perf: TQ4_1S native kernel 3.5× faster — 240 t/s, less VRAM than q8_0 conversion#57
Conversation
…AM than q8_0 Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel. Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time conversion (240 vs 176 t/s) while using 1.7× LESS VRAM (4.5 vs 7.5 GiB). Key optimizations (found via 86 automated experiments): 1. fp16 activation buffer — halves activation bandwidth (the bottleneck) 2. Shared-memory centroid LUT — eliminates constant memory serialization on divergent lane access (+89% single change) 3. Half2 arithmetic + strided block processing — 2× arithmetic density 4. Vectorized 128-bit loads — uint32×4 weights, int4 activations (+45%) 5. Register __byte_perm centroid decode — zero-memory centroid lookup 6. NWARPS 8→4 Also: - Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead of default. Native kernel is strictly better on both speed and VRAM. - Autoresearch harness gains coherence testing (server API + factual Q&A) to catch silent corruption that PPL alone misses. Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S): Upstream V12 runtime: 67 t/s (4.5 GiB VRAM) q8_0 conversion: 176 t/s (7.5 GiB VRAM) Native optimized: 240 t/s (4.5 GiB VRAM) ← this PR Quality (vs f16 baseline): PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55) Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078) NIAH: 5/5 Coherence: 4/4 (Paris, 4, print, Shakespeare)
TQ3_1S Regression Check — PASS
TQ3_1S uses the unchanged V8 scalar fallback path — no regression. The optimizations (shmem LUT, half2, |
Regression Test ResultsMac (M5 Max 128GB, Metal)Build: clean, zero errors. Qwen2.5-1.5B Q4_K_M:
No regressions. Metal path unaffected by CUDA changes. Windows (GTX 1080 Ti 11GB, CUDA 12.6, SM 6.1)Build: clean, zero errors. Qwen2.5-7B-Instruct:
Config I decode improvement (38.4 → 51.0 tok/s) is the native TQ4_1S kernel optimization showing up on Pascal. Prefill regression on Config I (956 → 587) needs investigation — may be related to the load-time conversion change (now opt-in). All 4 configs produce valid output. No crashes, no build warnings from the PR's files. |
Prefill regression confirmed — root cause identifiedThe prefill regression (956→587 on Pascal, 13K→6.5K on 5090) is from disabling the load-time q8_0 conversion. Native TQ4_1S prefill falls through to cuBLAS dequant-to-f16, which requires a full inverse WHT per element — much slower than the q8_0 cuBLAS path that only needs standard q8_0 dequant.
The tradeoff is clear: native wins decode (+29%), q8_0 wins prefill (+105%). Need a hybrid that uses native for decode and q8_0 for prefill — or accept the prefill cost for the VRAM savings. Options:
For now I'll switch the default back to conversion (option 1) since prefill is more commonly the bottleneck, and document the native opt-in for decode-heavy use cases. The +33% decode on Pascal and +29% on 5090 is real and valuable for interactive chat with long context. |
|
Testing on my AMD now for posterity, Asking a community member to verify with his CUDA. |
good analysis on the prefill tradeoff. option 1 is the right call. defaulting to conversion keeps existing behavior, and documenting native opt-in for decode-heavy workloads gives users the choice without surprising anyone. found a compile issue on AMD/ROCm will patch with HIP guards. wanted to flag it before merge. +33% decode on pascal is a great result. the autoresearch harness is proving its worth. |
Replace per-element generic dequant template (which repeats the full 32-element WHT butterfly 16 times per block) with a warp-cooperative version using __shfl_xor_sync. One WHT per block instead of 16. Note: this improves the dequant kernel itself but doesn't fix the prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs the q8_0 conversion path's native int8 tensor core GEMM. The dequant was never the slow part — the GEMM dispatch is fundamentally different. For prefill-heavy workloads, load-time q8_0 conversion remains the recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy interactive chat where the +29% decode speed matters more.
|
heads up from a community tester — turbo-innerq.cuh has a build issue on static builds (BUILD_SHARED_LIBS=OFF). The TURBO_IQ_API macro assumes GGML_BACKEND_BUILD is defined, but it's only set for shared builds. Fix: wrap the block in #ifdef GGML_BACKEND_SHARED matching the pattern from ggml-backend.h. For static builds TURBO_IQ_API should be empty (no dllexport/dllimport). |
|
Static build fix + multi-token kernel update Fixed the Also added a multi-token TQ4_1S dp4a kernel for Default remains load-time q8_0 conversion (option 1, as agreed).
Re: HIP — happy to coordinate on the guards once your patch is ready. |
Community testing: dual RTX 4090 resultsTested on dual 4090s with layer split. Found a multi-GPU bug and fix. Results
Multi-GPU bugNative TQ4_1S kernel crashes during decode on multi-GPU (dual 4090s). Prefill works fine (goes through cuBLAS mat_mul). Decode hits the native kernel which uses global static CUDA device pointers (
q8_0 conversion path works for both prefill and decode because it uses standard q8_0 mul_mat_vec. FixReplace static global buffers with per-device pool allocations from
This makes the buffers per-device and per-call, fixing multi-GPU decode. |
|
Multi-GPU fix applied Replaced static global No performance regression on single GPU (pp512=5,474, tg128=225 with |
- Multi-token dp4a kernel for ne[1]≤8 (speculative decoding, small batches) loads weight data once per block, reuses across all ncols_dst tokens - Runtime TQ4_1S→fp16 dequant + cuBLAS for ne[1]>8 prefill - Fix multi-GPU crash: replace static global CUDA buffers with per-device pool allocations from ctx.pool(id), matching mmvq.cu pattern - Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
AMD HIP/ROCm testing: RX 9070 XT (RDNA 4, gfx1201)Tested by community member on Windows 11, HIP SDK 7.1, Qwen2.5-1.5B-Instruct. Build fix required
- const int s0 = __dp4a(c0, a0, 0);
+ const int s0 = ggml_cuda_dp4a(c0, a0, 0);Regression results (Qwen2.5-1.5B, RX 9070 XT)
PPL: No regression (+1.8% vs Q8_0). Regression vs previous kernel (PR #45 V12)
PR 57's dp4a path is optimized for NVIDIA tensor cores (Ampere+/Blackwell) where it achieves 240 t/s (+3.5x). On RDNA 4, the May need an architecture-specific fallback to V12 for AMD GPUs, or an |
|
Fixed — replaced all 8 Re: RDNA4 regression — the dp4a/int8 path was designed around NVIDIA's dp4a throughput (Turing+). On RDNA4,
Option 1 is simpler and probably sufficient. Happy to add the AMD fallback path if you want it in this PR, or we can split it out. |
RDNA4 arch dispatch fix — cherry-pick readyBranch: What it doesRoutes AMD GPUs to a scalar half-precision TQ4_1S kernel instead of the dp4a int8 path. NVIDIA continues using dp4a.
Expected RDNA4 resultRestore V12-level decode (~135 t/s, 130% of Q8_0) instead of dp4a regression (101 t/s, 60% of Q8_0). Verified
Cherry-pickgit fetch https://github.com/TheTom/llama-cpp-turboquant.git fix/rdna4-arch-dispatch
git cherry-pick 96e211d |
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput (240 t/s on 5090). On RDNA4, sudot4 has different throughput characteristics and the q8_1 activation quantization adds overhead, causing a regression vs the V12 float kernel (101 vs 135 t/s on RX 9070 XT). Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD GPUs to a scalar half-precision kernel (same pattern as TQ3_1S). NVIDIA continues using the dp4a path. Changes: - Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations, shmem centroid LUT, scalar dot product (no dp4a/byte_perm) - Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path. - LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0) instead of dp4a regression (101 t/s, 60% of Q8_0). Co-Authored-By: Tom Turney <tturney1@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Cherry-picked Clean approach with the LAUNCH_SCALAR macro. Looking forward to the RDNA4 retest numbers. |
|
Thanks I'll test it out on my local PC. |
RDNA4 retest — RX 9070 XT (gfx1201, HIP SDK 7.1) — ALL PASSWith the arch dispatch fix (commit 96e211d cherry-picked by @signalnine): Speed (Qwen2.5-1.5B-Instruct, r=10)
Before vs after fix
PPL (wikitext-2-raw, 512 context, 20 chunks)
PPL: PASS — no regression. Output qualityPrompt: "What is the capital of France? Answer briefly." PASS — coherent, correct. Note for AMD testersRDNA 4 has severe cold-start JIT variance. With |
a4e8af4
into
TheTom:feature/turboquant-kv-cache
Ampere (RTX 3080 Ti, SM 86) — native kernel parity with q8_0 convertTested this PR ( Setup
Results
Delta: +0.9% pp512, 0% tg128 — both within run-to-run variance. InterpretationOn 5090 you saw native beat convert 240 → 176 t/s (+36%). On 3080 Ti they're identical. The PR's kernel optimizations (fp16 activation buffer, SMEM centroid LUT, half2 strided blocks, vectorized 128-bit loads, Not a regression — native is at least as fast as convert on Ampere, and the lower-VRAM advantage still holds. Just not the headline 3.5x. The case for merging still stands on Ampere: equal speed, lower memory, simpler code path. Cross-datapoint on the same build, Qwen3.5-9B TQ4_1S (48 layers, 8:1 GQA): convert actually edges native by ~5% pp512 (3776 vs 3592) — consistent with "native optimizations don't help on Ampere", and may hint that deeper models expose some overhead in the native path specifically on SM 86. Worth a look if you want to chase it. Happy to run additional configs (ctx sweep, different models) if useful. |
|
@seanrasch thanks for the Ampere validation, that was the last unchecked box on the test plan. really appreciate the extra eyes. good to confirm no regression on SM 86. the parity result makes sense — the kernel wins (shmem LUT, half2, vectorized loads) are latency optimizations that only show up when you're not bandwidth-bound, which Ampere is at these model sizes. the 5% prefill gap on deeper models is a useful data point. lines up with what we'd expect from the per-element inverse WHT overhead in the native dequant path. default is conversion-on for prefill so it shouldn't bite anyone in practice. thanks again for taking the time to test and write it up. |
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR TheTom#57 optimisation TheTom#2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR TheTom#57 optimisation TheTom#2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR TheTom#57 optimisation TheTom#2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL Δ | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570→1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873→2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263→2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263→2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(§5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL Δ | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570→1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873→2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263→2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263→2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(§5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL Δ | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570→1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873→2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263→2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263→2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(§5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR TheTom#57 optimisation TheTom#2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL Δ | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570→1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873→2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263→2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263→2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(§5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR TheTom#57 optimisation TheTom#2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL Δ | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570→1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873→2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263→2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263→2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(§5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR TheTom#57 optimisation TheTom#2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL Δ | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570→1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873→2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263→2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263→2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(§5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR TheTom#57 optimisation TheTom#2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL Δ | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570→1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873→2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263→2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263→2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(§5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR TheTom#57 optimisation TheTom#2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.
vulkan: TQ4_1s support for model weights (#69)
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)
fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10
The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.
Users can opt-in with TURBO_FLASH=1 for testing/debugging.
No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: gate turbo V unpad on V type, not K type
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.
Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.
Compared to commit 1 (VEC force), this approach recovers prefill:
pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
Decode: unchanged across all configs
Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).
Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.
Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)
Refs: ggml-org/llama.cpp#20969
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: force VEC FA path for quantized KV on HIP/ROCm
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.
This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).
Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.
Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.
Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).
Refs: ggml-org/llama.cpp#20969 (community reports)
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:
* ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
rounding to a runtime SPIR-V patch and dropped the _rte shader
variants plus rte.glsl itself.
* TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
support against a base that still had those variants.
After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.
Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.
Changes:
* Drop the if (float_controls_rte_fp16) / else branches around
cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
variant, matching upstream post-1f30ac0ce.
* Remove the #include "rte.glsl" from copy_to_quant.comp.
* Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
generator (no MMQ code path for it).
* Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
and the _cm2 counterpart alongside the other quant types.
Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):
* Full HIP+Vulkan build is clean with no shader compile errors.
* test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
* test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
530 cases pass (previously aborted on case 3).
* test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
still green (no HIP regression).
* llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
-ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
correctness issue on the Vulkan turbo3 decode path is pre-existing
and orthogonal to this change.
llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:
F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17
Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks
fix nix build: Add spirv-headers to vulkanBuildInputs
ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81
spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a
but not added to the nix build environment
fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu
The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.
The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.
CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.
Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF
Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline
Fixes #83
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.
The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.
Extend the predicate to include f16 and bf16 alongside turbo + q8_0.
Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.
Co-Authored-By: tturney@psyguard.ai
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check
turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: fix turbo build and test failures
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix GGML_OP_COUNT assertion for RPC
When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.
Fix memory explosion on Apple Silicon
Add GitHub Sponsors funding link
vulkan: add turbo3 backend tests
- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
(f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
vulkan: fix and complete turbo3 KV cache support
feat: add CDNA4 (gfx950/MI355X) support + test results
Add AMD Instinct MI355X (gfx950) architecture support:
Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
* bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8: mfma_i32_16x16x32_i8 (same as CDNA3)
* f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)
MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)
Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
docs: add AMD Instinct MI300X (gfx942) ROCm test results
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.
Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)
MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
metal: add TurboFlash attention kernel for turbo3 KV cache decode
Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.
Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
- Each lane handles DK/32 interleaved dimensions
- Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
- K scoring via simd_sum dot product
- turbo3 V unpack with register codebook (8 centroids)
- Online softmax (m/l/o state) entirely in registers
- Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
- Online softmax correction with global max/sum
- Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
- Eliminates 5 of 7 threadgroup barriers vs naive butterfly
Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).
Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)
Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Vulkan compute shader support for turbo3 KV cache
Full turbo3 quantize/dequant pipeline for Vulkan backend:
- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch
Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).
Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.
Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch
Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)
Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.
Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.
For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0
Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).
Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4
Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
to catch silent corruption that PPL alone misses.
Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
Upstream V12 runtime: 67 t/s (4.5 GiB VRAM)
q8_0 conversion: 176 t/s (7.5 GiB VRAM)
Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR
Quality (vs f16 baseline):
PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55)
Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
NIAH: 5/5
Coherence: 4/4 (Paris, 4, print, Shakespeare)
fix: cap map0 kernel shmem for 256-expert MoE models
Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.
Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.
Fixes #58
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
with flash attention тАФ needs chunked map0 dispatch (follow-up)
Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.
Reported by Madreag on RTX 5090 WSL2.
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Enhance Metal operations for TQ weights and concurrency handling for Gemma 4
- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.
Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.
signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).
Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).
Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).
Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix turbo4 C reference WHT dequant mismatch (#43)
Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.
Affects CPU-only inference fallback path. GPU users unaffected.
Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.
Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)
Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).
Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.
Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows
No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.
Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Windows MSVC build compatibility for TQ weight types
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
to fix C vs C++ name mangling across DLL boundary
All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
C2065: 'M_PI': undeclared identifier
LNK2001: unresolved external symbol turbo3_cpu_wht_group_size
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch
Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).
Key changes:
- No global scratch buffer (eliminates CUDA graph incompatibility)
- No separate kernel launch for pre-rotation
- Activation stays in shmem (~14-32 KB) instead of global memory
- Single __syncthreads between rotation and dot product (NOT in inner loop)
- V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)
This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.
Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).
NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.
Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.
Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
vs q8_0 (177 t/s): 39%
PPL: 8.82 (identical)
Comprehensive optimization log (13 versions tested):
cuBLAS baseline: 20 t/s
V1 per-warp WHT, 4 warps: 60 t/s (3.0x)
V3 shmem activation cache: 33 t/s (syncthreads kills it)
V5 multi-warp per row: 62 t/s
V6 LUT (shmem): 37 t/s (sync overhead)
V7 8 warps clean: 62 t/s
V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x)
V9 pre-rot + q8_1: 70 t/s (marginal)
V10 4-elem/thread: 57 t/s
V13 8-elem, 4-thread dot: 45 t/s
NR0=2/4/8: all regressed (register spill / cache thrash)
The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration
- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types
Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
fix: disable upstream attn rotation by default (conflicts with TurboQuant)
Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.
Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add post-unrotate memory barrier for in-layer mixing safety
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.
The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW
Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.
Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch
Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2
Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused CENTROIDS_1BIT constant
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing TurboQuant FA template instances for HIP/ROCm build
The HIP build was missing 9 turbo cross-type flash attention vec
instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.
Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
since those instance files are already excluded from the HIP build
(they exceed HIP's 65536-byte local memory limit).
Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.
Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.
Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.
Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Increase turbo3/turbo2 block size from 32 to 128
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels
Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).
HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
to support both 3-arg and 4-arg calls (CUDA allows defaulting width
to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
on wave64 platforms, turbo code expects 32-bit)
HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
as CUDA CMakeLists, were missing from HIP build)
Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: KV state serialization uses padded tensor width (issue #28 follow-up)
state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.
Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Boundary V (experimental) тАФ layer-aware V compression
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)
The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.
Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.
Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1
Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".
Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).
Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
(turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4
PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).
Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.
Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)
GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.
Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).
ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.
Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
(ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
(always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance
Tested:
GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
DeepSeek: 143 t/s, PPL 11.61 (unchanged)
16/16 coherence, NIAH 9-10/11 at 32K
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention:
- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)
The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.
Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs
Closes #27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: asymmetric K/V quant support for Metal flash attention
- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.
Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.
Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).
Closes #19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)
The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.
All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.
PPL summary (Qwen3.5, head_dim=128, wikitext-2):
f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel
In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.
Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.
Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.
Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: InnerQ per-channel equalization + turbo2 64-group fallback
InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).
Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).
Math: <Q/s, s*K> = <Q, K> preserves dot products.
Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity
Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.
Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7
The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.
Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.
Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.
Block format: block_turbo2_0 = 10 bytes per 32 values
- norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
- qs (8B): 2-bit centroid indices, 4 per byte
Usage: -ctk turbo2 -ctv turbo2
Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration
NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.
Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).
WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).
Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)
Users can now mix turbo3 and q8_0 independently for K and V caches:
-ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality
-ctk q8_0 -ctv turbo3 # opposite tradeoff
Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
(fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
WHT must not fire. For MLA, V is a view of K so v->type correctly
reflects K's turbo type.
Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
turbo3/q8_0: 9/11 (1 extra miss, normal variance)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.
Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
so CPU flash attention can handle turbo3 K/V for models where CUDA FA
returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy
Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.
Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})
Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.
Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].
Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.
Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
тАв Shared-memory WHT butterfly (7 stages, no atomics)
тАв Warp-reduce L2 norm + inter-warp accumulate via smem
тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block тАФ load qs/signs once instead of twice per pair
тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75├Ч q8_0)
After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill
Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).
launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically. Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ
Before (VEC only): After (MMA for prefill):
2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0)
8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0)
32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0)
Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).
Decode unchanged (VEC, ~0.64-0.82…
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.
vulkan: TQ4_1s support for model weights (#69)
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)
fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10
The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.
Users can opt-in with TURBO_FLASH=1 for testing/debugging.
No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: gate turbo V unpad on V type, not K type
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.
Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.
Compared to commit 1 (VEC force), this approach recovers prefill:
pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
Decode: unchanged across all configs
Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).
Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.
Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)
Refs: ggml-org/llama.cpp#20969
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: force VEC FA path for quantized KV on HIP/ROCm
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.
This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).
Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.
Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.
Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).
Refs: ggml-org/llama.cpp#20969 (community reports)
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:
* ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
rounding to a runtime SPIR-V patch and dropped the _rte shader
variants plus rte.glsl itself.
* TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
support against a base that still had those variants.
After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.
Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.
Changes:
* Drop the if (float_controls_rte_fp16) / else branches around
cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
variant, matching upstream post-1f30ac0ce.
* Remove the #include "rte.glsl" from copy_to_quant.comp.
* Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
generator (no MMQ code path for it).
* Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
and the _cm2 counterpart alongside the other quant types.
Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):
* Full HIP+Vulkan build is clean with no shader compile errors.
* test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
* test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
530 cases pass (previously aborted on case 3).
* test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
still green (no HIP regression).
* llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
-ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
correctness issue on the Vulkan turbo3 decode path is pre-existing
and orthogonal to this change.
llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:
F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17
Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks
fix nix build: Add spirv-headers to vulkanBuildInputs
ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81
spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a
but not added to the nix build environment
fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu
The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.
The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.
CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.
Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF
Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline
Fixes #83
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.
The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.
Extend the predicate to include f16 and bf16 alongside turbo + q8_0.
Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.
Co-Authored-By: tturney@psyguard.ai
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check
turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: fix turbo build and test failures
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix GGML_OP_COUNT assertion for RPC
When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.
Fix memory explosion on Apple Silicon
Add GitHub Sponsors funding link
vulkan: add turbo3 backend tests
- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
(f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
vulkan: fix and complete turbo3 KV cache support
feat: add CDNA4 (gfx950/MI355X) support + test results
Add AMD Instinct MI355X (gfx950) architecture support:
Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
* bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8: mfma_i32_16x16x32_i8 (same as CDNA3)
* f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)
MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)
Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
docs: add AMD Instinct MI300X (gfx942) ROCm test results
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.
Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)
MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
metal: add TurboFlash attention kernel for turbo3 KV cache decode
Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.
Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
- Each lane handles DK/32 interleaved dimensions
- Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
- K scoring via simd_sum dot product
- turbo3 V unpack with register codebook (8 centroids)
- Online softmax (m/l/o state) entirely in registers
- Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
- Online softmax correction with global max/sum
- Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
- Eliminates 5 of 7 threadgroup barriers vs naive butterfly
Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).
Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)
Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Vulkan compute shader support for turbo3 KV cache
Full turbo3 quantize/dequant pipeline for Vulkan backend:
- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch
Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).
Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.
Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch
Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)
Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.
Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.
For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0
Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).
Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4
Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
to catch silent corruption that PPL alone misses.
Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
Upstream V12 runtime: 67 t/s (4.5 GiB VRAM)
q8_0 conversion: 176 t/s (7.5 GiB VRAM)
Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR
Quality (vs f16 baseline):
PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55)
Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
NIAH: 5/5
Coherence: 4/4 (Paris, 4, print, Shakespeare)
fix: cap map0 kernel shmem for 256-expert MoE models
Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.
Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.
Fixes #58
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
with flash attention тАФ needs chunked map0 dispatch (follow-up)
Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.
Reported by Madreag on RTX 5090 WSL2.
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Enhance Metal operations for TQ weights and concurrency handling for Gemma 4
- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.
Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.
signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).
Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).
Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).
Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix turbo4 C reference WHT dequant mismatch (#43)
Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.
Affects CPU-only inference fallback path. GPU users unaffected.
Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.
Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)
Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).
Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.
Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows
No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.
Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Windows MSVC build compatibility for TQ weight types
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
to fix C vs C++ name mangling across DLL boundary
All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
C2065: 'M_PI': undeclared identifier
LNK2001: unresolved external symbol turbo3_cpu_wht_group_size
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch
Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).
Key changes:
- No global scratch buffer (eliminates CUDA graph incompatibility)
- No separate kernel launch for pre-rotation
- Activation stays in shmem (~14-32 KB) instead of global memory
- Single __syncthreads between rotation and dot product (NOT in inner loop)
- V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)
This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.
Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).
NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.
Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.
Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
vs q8_0 (177 t/s): 39%
PPL: 8.82 (identical)
Comprehensive optimization log (13 versions tested):
cuBLAS baseline: 20 t/s
V1 per-warp WHT, 4 warps: 60 t/s (3.0x)
V3 shmem activation cache: 33 t/s (syncthreads kills it)
V5 multi-warp per row: 62 t/s
V6 LUT (shmem): 37 t/s (sync overhead)
V7 8 warps clean: 62 t/s
V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x)
V9 pre-rot + q8_1: 70 t/s (marginal)
V10 4-elem/thread: 57 t/s
V13 8-elem, 4-thread dot: 45 t/s
NR0=2/4/8: all regressed (register spill / cache thrash)
The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration
- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types
Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
fix: disable upstream attn rotation by default (conflicts with TurboQuant)
Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.
Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add post-unrotate memory barrier for in-layer mixing safety
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.
The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW
Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.
Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch
Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2
Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused CENTROIDS_1BIT constant
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing TurboQuant FA template instances for HIP/ROCm build
The HIP build was missing 9 turbo cross-type flash attention vec
instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.
Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
since those instance files are already excluded from the HIP build
(they exceed HIP's 65536-byte local memory limit).
Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.
Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.
Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.
Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Increase turbo3/turbo2 block size from 32 to 128
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels
Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).
HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
to support both 3-arg and 4-arg calls (CUDA allows defaulting width
to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
on wave64 platforms, turbo code expects 32-bit)
HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
as CUDA CMakeLists, were missing from HIP build)
Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: KV state serialization uses padded tensor width (issue #28 follow-up)
state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.
Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Boundary V (experimental) тАФ layer-aware V compression
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)
The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.
Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.
Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1
Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".
Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).
Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
(turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4
PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).
Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.
Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)
GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.
Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).
ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.
Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
(ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
(always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance
Tested:
GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
DeepSeek: 143 t/s, PPL 11.61 (unchanged)
16/16 coherence, NIAH 9-10/11 at 32K
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention:
- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)
The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.
Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs
Closes #27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: asymmetric K/V quant support for Metal flash attention
- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.
Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.
Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).
Closes #19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)
The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.
All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.
PPL summary (Qwen3.5, head_dim=128, wikitext-2):
f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel
In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.
Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.
Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.
Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: InnerQ per-channel equalization + turbo2 64-group fallback
InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).
Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).
Math: <Q/s, s*K> = <Q, K> preserves dot products.
Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity
Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.
Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7
The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.
Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.
Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.
Block format: block_turbo2_0 = 10 bytes per 32 values
- norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
- qs (8B): 2-bit centroid indices, 4 per byte
Usage: -ctk turbo2 -ctv turbo2
Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration
NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.
Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).
WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).
Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)
Users can now mix turbo3 and q8_0 independently for K and V caches:
-ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality
-ctk q8_0 -ctv turbo3 # opposite tradeoff
Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
(fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
WHT must not fire. For MLA, V is a view of K so v->type correctly
reflects K's turbo type.
Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
turbo3/q8_0: 9/11 (1 extra miss, normal variance)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.
Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
so CPU flash attention can handle turbo3 K/V for models where CUDA FA
returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy
Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.
Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})
Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.
Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].
Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.
Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
тАв Shared-memory WHT butterfly (7 stages, no atomics)
тАв Warp-reduce L2 norm + inter-warp accumulate via smem
тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block тАФ load qs/signs once instead of twice per pair
тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75├Ч q8_0)
After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill
Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).
launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically. Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ
Before (VEC only): After (MMA for prefill):
2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0)
8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0)
32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0)
Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).
Decode unchanged (VEC, ~0.64-0.82…
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.
vulkan: TQ4_1s support for model weights (#69)
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)
fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10
The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.
Users can opt-in with TURBO_FLASH=1 for testing/debugging.
No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: gate turbo V unpad on V type, not K type
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.
Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.
Compared to commit 1 (VEC force), this approach recovers prefill:
pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
Decode: unchanged across all configs
Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).
Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.
Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)
Refs: ggml-org/llama.cpp#20969
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: force VEC FA path for quantized KV on HIP/ROCm
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.
This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).
Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.
Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.
Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).
Refs: ggml-org/llama.cpp#20969 (community reports)
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:
* ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
rounding to a runtime SPIR-V patch and dropped the _rte shader
variants plus rte.glsl itself.
* TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
support against a base that still had those variants.
After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.
Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.
Changes:
* Drop the if (float_controls_rte_fp16) / else branches around
cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
variant, matching upstream post-1f30ac0ce.
* Remove the #include "rte.glsl" from copy_to_quant.comp.
* Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
generator (no MMQ code path for it).
* Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
and the _cm2 counterpart alongside the other quant types.
Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):
* Full HIP+Vulkan build is clean with no shader compile errors.
* test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
* test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
530 cases pass (previously aborted on case 3).
* test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
still green (no HIP regression).
* llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
-ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
correctness issue on the Vulkan turbo3 decode path is pre-existing
and orthogonal to this change.
llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:
F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17
Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks
fix nix build: Add spirv-headers to vulkanBuildInputs
ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81
spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a
but not added to the nix build environment
fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu
The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.
The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.
CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.
Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF
Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline
Fixes #83
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.
The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.
Extend the predicate to include f16 and bf16 alongside turbo + q8_0.
Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.
Co-Authored-By: tturney@psyguard.ai
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check
turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: fix turbo build and test failures
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix GGML_OP_COUNT assertion for RPC
When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.
Fix memory explosion on Apple Silicon
Add GitHub Sponsors funding link
vulkan: add turbo3 backend tests
- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
(f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
vulkan: fix and complete turbo3 KV cache support
feat: add CDNA4 (gfx950/MI355X) support + test results
Add AMD Instinct MI355X (gfx950) architecture support:
Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
* bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8: mfma_i32_16x16x32_i8 (same as CDNA3)
* f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)
MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)
Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
docs: add AMD Instinct MI300X (gfx942) ROCm test results
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.
Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)
MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
metal: add TurboFlash attention kernel for turbo3 KV cache decode
Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.
Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
- Each lane handles DK/32 interleaved dimensions
- Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
- K scoring via simd_sum dot product
- turbo3 V unpack with register codebook (8 centroids)
- Online softmax (m/l/o state) entirely in registers
- Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
- Online softmax correction with global max/sum
- Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
- Eliminates 5 of 7 threadgroup barriers vs naive butterfly
Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).
Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)
Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Vulkan compute shader support for turbo3 KV cache
Full turbo3 quantize/dequant pipeline for Vulkan backend:
- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch
Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).
Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.
Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch
Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)
Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.
Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.
For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0
Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).
Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4
Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
to catch silent corruption that PPL alone misses.
Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
Upstream V12 runtime: 67 t/s (4.5 GiB VRAM)
q8_0 conversion: 176 t/s (7.5 GiB VRAM)
Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR
Quality (vs f16 baseline):
PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55)
Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
NIAH: 5/5
Coherence: 4/4 (Paris, 4, print, Shakespeare)
fix: cap map0 kernel shmem for 256-expert MoE models
Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.
Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.
Fixes #58
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
with flash attention тАФ needs chunked map0 dispatch (follow-up)
Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.
Reported by Madreag on RTX 5090 WSL2.
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Enhance Metal operations for TQ weights and concurrency handling for Gemma 4
- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.
Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.
signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).
Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).
Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).
Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix turbo4 C reference WHT dequant mismatch (#43)
Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.
Affects CPU-only inference fallback path. GPU users unaffected.
Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.
Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)
Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).
Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.
Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows
No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.
Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Windows MSVC build compatibility for TQ weight types
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
to fix C vs C++ name mangling across DLL boundary
All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
C2065: 'M_PI': undeclared identifier
LNK2001: unresolved external symbol turbo3_cpu_wht_group_size
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch
Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).
Key changes:
- No global scratch buffer (eliminates CUDA graph incompatibility)
- No separate kernel launch for pre-rotation
- Activation stays in shmem (~14-32 KB) instead of global memory
- Single __syncthreads between rotation and dot product (NOT in inner loop)
- V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)
This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.
Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).
NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.
Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.
Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
vs q8_0 (177 t/s): 39%
PPL: 8.82 (identical)
Comprehensive optimization log (13 versions tested):
cuBLAS baseline: 20 t/s
V1 per-warp WHT, 4 warps: 60 t/s (3.0x)
V3 shmem activation cache: 33 t/s (syncthreads kills it)
V5 multi-warp per row: 62 t/s
V6 LUT (shmem): 37 t/s (sync overhead)
V7 8 warps clean: 62 t/s
V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x)
V9 pre-rot + q8_1: 70 t/s (marginal)
V10 4-elem/thread: 57 t/s
V13 8-elem, 4-thread dot: 45 t/s
NR0=2/4/8: all regressed (register spill / cache thrash)
The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration
- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types
Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
fix: disable upstream attn rotation by default (conflicts with TurboQuant)
Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.
Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add post-unrotate memory barrier for in-layer mixing safety
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.
The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW
Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.
Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch
Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2
Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused CENTROIDS_1BIT constant
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing TurboQuant FA template instances for HIP/ROCm build
The HIP build was missing 9 turbo cross-type flash attention vec
instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.
Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
since those instance files are already excluded from the HIP build
(they exceed HIP's 65536-byte local memory limit).
Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.
Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.
Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.
Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Increase turbo3/turbo2 block size from 32 to 128
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels
Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).
HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
to support both 3-arg and 4-arg calls (CUDA allows defaulting width
to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
on wave64 platforms, turbo code expects 32-bit)
HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
as CUDA CMakeLists, were missing from HIP build)
Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: KV state serialization uses padded tensor width (issue #28 follow-up)
state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.
Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Boundary V (experimental) тАФ layer-aware V compression
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)
The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.
Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.
Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1
Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".
Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).
Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
(turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4
PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).
Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.
Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)
GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.
Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).
ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.
Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
(ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
(always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance
Tested:
GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
DeepSeek: 143 t/s, PPL 11.61 (unchanged)
16/16 coherence, NIAH 9-10/11 at 32K
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention:
- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)
The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.
Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs
Closes #27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: asymmetric K/V quant support for Metal flash attention
- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.
Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.
Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).
Closes #19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)
The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.
All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.
PPL summary (Qwen3.5, head_dim=128, wikitext-2):
f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel
In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.
Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.
Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.
Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: InnerQ per-channel equalization + turbo2 64-group fallback
InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).
Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).
Math: <Q/s, s*K> = <Q, K> preserves dot products.
Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity
Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.
Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7
The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.
Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.
Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.
Block format: block_turbo2_0 = 10 bytes per 32 values
- norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
- qs (8B): 2-bit centroid indices, 4 per byte
Usage: -ctk turbo2 -ctv turbo2
Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration
NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.
Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).
WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).
Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)
Users can now mix turbo3 and q8_0 independently for K and V caches:
-ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality
-ctk q8_0 -ctv turbo3 # opposite tradeoff
Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
(fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
WHT must not fire. For MLA, V is a view of K so v->type correctly
reflects K's turbo type.
Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
turbo3/q8_0: 9/11 (1 extra miss, normal variance)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.
Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
so CPU flash attention can handle turbo3 K/V for models where CUDA FA
returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy
Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.
Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})
Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.
Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].
Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.
Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
тАв Shared-memory WHT butterfly (7 stages, no atomics)
тАв Warp-reduce L2 norm + inter-warp accumulate via smem
тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block тАФ load qs/signs once instead of twice per pair
тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75├Ч q8_0)
After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill
Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).
launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically. Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ
Before (VEC only): After (MMA for prefill):
2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0)
8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0)
32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0)
Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).
Decode unchanged (VEC, ~0.64-0.82…
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.
vulkan: TQ4_1s support for model weights (#69)
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)
fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10
The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.
Users can opt-in with TURBO_FLASH=1 for testing/debugging.
No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: gate turbo V unpad on V type, not K type
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.
Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.
Compared to commit 1 (VEC force), this approach recovers prefill:
pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
Decode: unchanged across all configs
Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).
Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.
Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)
Refs: ggml-org/llama.cpp#20969
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: force VEC FA path for quantized KV on HIP/ROCm
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.
This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).
Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.
Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.
Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).
Refs: ggml-org/llama.cpp#20969 (community reports)
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:
* ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
rounding to a runtime SPIR-V patch and dropped the _rte shader
variants plus rte.glsl itself.
* TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
support against a base that still had those variants.
After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.
Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.
Changes:
* Drop the if (float_controls_rte_fp16) / else branches around
cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
variant, matching upstream post-1f30ac0ce.
* Remove the #include "rte.glsl" from copy_to_quant.comp.
* Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
generator (no MMQ code path for it).
* Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
and the _cm2 counterpart alongside the other quant types.
Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):
* Full HIP+Vulkan build is clean with no shader compile errors.
* test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
* test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
530 cases pass (previously aborted on case 3).
* test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
still green (no HIP regression).
* llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
-ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
correctness issue on the Vulkan turbo3 decode path is pre-existing
and orthogonal to this change.
llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:
F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17
Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks
fix nix build: Add spirv-headers to vulkanBuildInputs
ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81
spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a
but not added to the nix build environment
fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu
The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.
The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.
CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.
Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF
Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline
Fixes #83
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.
The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.
Extend the predicate to include f16 and bf16 alongside turbo + q8_0.
Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.
Co-Authored-By: tturney@psyguard.ai
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check
turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: fix turbo build and test failures
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix GGML_OP_COUNT assertion for RPC
When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.
Fix memory explosion on Apple Silicon
Add GitHub Sponsors funding link
vulkan: add turbo3 backend tests
- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
(f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
vulkan: fix and complete turbo3 KV cache support
feat: add CDNA4 (gfx950/MI355X) support + test results
Add AMD Instinct MI355X (gfx950) architecture support:
Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
* bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8: mfma_i32_16x16x32_i8 (same as CDNA3)
* f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)
MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)
Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
docs: add AMD Instinct MI300X (gfx942) ROCm test results
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.
Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)
MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
metal: add TurboFlash attention kernel for turbo3 KV cache decode
Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.
Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
- Each lane handles DK/32 interleaved dimensions
- Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
- K scoring via simd_sum dot product
- turbo3 V unpack with register codebook (8 centroids)
- Online softmax (m/l/o state) entirely in registers
- Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
- Online softmax correction with global max/sum
- Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
- Eliminates 5 of 7 threadgroup barriers vs naive butterfly
Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).
Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)
Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Vulkan compute shader support for turbo3 KV cache
Full turbo3 quantize/dequant pipeline for Vulkan backend:
- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch
Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).
Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.
Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch
Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)
Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.
Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.
For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0
Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).
Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4
Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
to catch silent corruption that PPL alone misses.
Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
Upstream V12 runtime: 67 t/s (4.5 GiB VRAM)
q8_0 conversion: 176 t/s (7.5 GiB VRAM)
Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR
Quality (vs f16 baseline):
PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55)
Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
NIAH: 5/5
Coherence: 4/4 (Paris, 4, print, Shakespeare)
fix: cap map0 kernel shmem for 256-expert MoE models
Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.
Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.
Fixes #58
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
with flash attention тАФ needs chunked map0 dispatch (follow-up)
Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.
Reported by Madreag on RTX 5090 WSL2.
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Enhance Metal operations for TQ weights and concurrency handling for Gemma 4
- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.
Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.
signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).
Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).
Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).
Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix turbo4 C reference WHT dequant mismatch (#43)
Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.
Affects CPU-only inference fallback path. GPU users unaffected.
Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.
Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)
Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).
Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.
Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows
No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.
Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Windows MSVC build compatibility for TQ weight types
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
to fix C vs C++ name mangling across DLL boundary
All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
C2065: 'M_PI': undeclared identifier
LNK2001: unresolved external symbol turbo3_cpu_wht_group_size
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch
Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).
Key changes:
- No global scratch buffer (eliminates CUDA graph incompatibility)
- No separate kernel launch for pre-rotation
- Activation stays in shmem (~14-32 KB) instead of global memory
- Single __syncthreads between rotation and dot product (NOT in inner loop)
- V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)
This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.
Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).
NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.
Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.
Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
vs q8_0 (177 t/s): 39%
PPL: 8.82 (identical)
Comprehensive optimization log (13 versions tested):
cuBLAS baseline: 20 t/s
V1 per-warp WHT, 4 warps: 60 t/s (3.0x)
V3 shmem activation cache: 33 t/s (syncthreads kills it)
V5 multi-warp per row: 62 t/s
V6 LUT (shmem): 37 t/s (sync overhead)
V7 8 warps clean: 62 t/s
V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x)
V9 pre-rot + q8_1: 70 t/s (marginal)
V10 4-elem/thread: 57 t/s
V13 8-elem, 4-thread dot: 45 t/s
NR0=2/4/8: all regressed (register spill / cache thrash)
The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration
- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types
Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
fix: disable upstream attn rotation by default (conflicts with TurboQuant)
Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.
Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add post-unrotate memory barrier for in-layer mixing safety
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.
The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW
Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.
Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch
Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2
Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused CENTROIDS_1BIT constant
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing TurboQuant FA template instances for HIP/ROCm build
The HIP build was missing 9 turbo cross-type flash attention vec
instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.
Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
since those instance files are already excluded from the HIP build
(they exceed HIP's 65536-byte local memory limit).
Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.
Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.
Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.
Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Increase turbo3/turbo2 block size from 32 to 128
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels
Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).
HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
to support both 3-arg and 4-arg calls (CUDA allows defaulting width
to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
on wave64 platforms, turbo code expects 32-bit)
HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
as CUDA CMakeLists, were missing from HIP build)
Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: KV state serialization uses padded tensor width (issue #28 follow-up)
state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.
Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Boundary V (experimental) тАФ layer-aware V compression
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)
The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.
Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.
Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1
Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".
Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).
Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
(turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4
PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).
Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.
Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)
GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.
Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).
ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.
Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
(ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
(always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance
Tested:
GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
DeepSeek: 143 t/s, PPL 11.61 (unchanged)
16/16 coherence, NIAH 9-10/11 at 32K
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention:
- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)
The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.
Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs
Closes #27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: asymmetric K/V quant support for Metal flash attention
- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.
Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.
Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).
Closes #19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)
The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.
All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.
PPL summary (Qwen3.5, head_dim=128, wikitext-2):
f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel
In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.
Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.
Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.
Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: InnerQ per-channel equalization + turbo2 64-group fallback
InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).
Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).
Math: <Q/s, s*K> = <Q, K> preserves dot products.
Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity
Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.
Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7
The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.
Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.
Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.
Block format: block_turbo2_0 = 10 bytes per 32 values
- norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
- qs (8B): 2-bit centroid indices, 4 per byte
Usage: -ctk turbo2 -ctv turbo2
Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration
NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.
Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).
WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).
Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)
Users can now mix turbo3 and q8_0 independently for K and V caches:
-ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality
-ctk q8_0 -ctv turbo3 # opposite tradeoff
Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
(fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
WHT must not fire. For MLA, V is a view of K so v->type correctly
reflects K's turbo type.
Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
turbo3/q8_0: 9/11 (1 extra miss, normal variance)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.
Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
so CPU flash attention can handle turbo3 K/V for models where CUDA FA
returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy
Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.
Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})
Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.
Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].
Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.
Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
тАв Shared-memory WHT butterfly (7 stages, no atomics)
тАв Warp-reduce L2 norm + inter-warp accumulate via smem
тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block тАФ load qs/signs once instead of twice per pair
тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75├Ч q8_0)
After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill
Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).
launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically. Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ
Before (VEC only): After (MMA for prefill):
2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0)
8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0)
32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0)
Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).
Decode unchanged (VEC, ~0.64-0.82…
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.
vulkan: TQ4_1s support for model weights (#69)
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)
fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10
The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.
Users can opt-in with TURBO_FLASH=1 for testing/debugging.
No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: gate turbo V unpad on V type, not K type
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.
Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.
Compared to commit 1 (VEC force), this approach recovers prefill:
pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
Decode: unchanged across all configs
Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).
Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.
Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)
Refs: ggml-org/llama.cpp#20969
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: force VEC FA path for quantized KV on HIP/ROCm
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.
This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).
Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.
Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.
Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).
Refs: ggml-org/llama.cpp#20969 (community reports)
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:
* ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
rounding to a runtime SPIR-V patch and dropped the _rte shader
variants plus rte.glsl itself.
* TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
support against a base that still had those variants.
After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.
Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.
Changes:
* Drop the if (float_controls_rte_fp16) / else branches around
cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
variant, matching upstream post-1f30ac0ce.
* Remove the #include "rte.glsl" from copy_to_quant.comp.
* Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
generator (no MMQ code path for it).
* Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
and the _cm2 counterpart alongside the other quant types.
Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):
* Full HIP+Vulkan build is clean with no shader compile errors.
* test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
* test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
530 cases pass (previously aborted on case 3).
* test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
still green (no HIP regression).
* llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
-ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
correctness issue on the Vulkan turbo3 decode path is pre-existing
and orthogonal to this change.
llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:
F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17
Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks
fix nix build: Add spirv-headers to vulkanBuildInputs
ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81
spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a
but not added to the nix build environment
fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu
The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.
The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.
CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.
Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF
Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline
Fixes #83
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.
The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.
Extend the predicate to include f16 and bf16 alongside turbo + q8_0.
Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.
Co-Authored-By: tturney@psyguard.ai
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check
turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: fix turbo build and test failures
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix GGML_OP_COUNT assertion for RPC
When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.
Fix memory explosion on Apple Silicon
Add GitHub Sponsors funding link
vulkan: add turbo3 backend tests
- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
(f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
vulkan: fix and complete turbo3 KV cache support
feat: add CDNA4 (gfx950/MI355X) support + test results
Add AMD Instinct MI355X (gfx950) architecture support:
Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
* bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8: mfma_i32_16x16x32_i8 (same as CDNA3)
* f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)
MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)
Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
docs: add AMD Instinct MI300X (gfx942) ROCm test results
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.
Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)
MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
metal: add TurboFlash attention kernel for turbo3 KV cache decode
Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.
Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
- Each lane handles DK/32 interleaved dimensions
- Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
- K scoring via simd_sum dot product
- turbo3 V unpack with register codebook (8 centroids)
- Online softmax (m/l/o state) entirely in registers
- Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
- Online softmax correction with global max/sum
- Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
- Eliminates 5 of 7 threadgroup barriers vs naive butterfly
Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).
Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)
Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Vulkan compute shader support for turbo3 KV cache
Full turbo3 quantize/dequant pipeline for Vulkan backend:
- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch
Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).
Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.
Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch
Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)
Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.
Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.
For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0
Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).
Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4
Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
to catch silent corruption that PPL alone misses.
Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
Upstream V12 runtime: 67 t/s (4.5 GiB VRAM)
q8_0 conversion: 176 t/s (7.5 GiB VRAM)
Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR
Quality (vs f16 baseline):
PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55)
Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
NIAH: 5/5
Coherence: 4/4 (Paris, 4, print, Shakespeare)
fix: cap map0 kernel shmem for 256-expert MoE models
Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.
Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.
Fixes #58
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
with flash attention тАФ needs chunked map0 dispatch (follow-up)
Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.
Reported by Madreag on RTX 5090 WSL2.
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Enhance Metal operations for TQ weights and concurrency handling for Gemma 4
- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.
Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.
signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).
Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).
Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).
Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix turbo4 C reference WHT dequant mismatch (#43)
Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.
Affects CPU-only inference fallback path. GPU users unaffected.
Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.
Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)
Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).
Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.
Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows
No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.
Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Windows MSVC build compatibility for TQ weight types
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
to fix C vs C++ name mangling across DLL boundary
All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
C2065: 'M_PI': undeclared identifier
LNK2001: unresolved external symbol turbo3_cpu_wht_group_size
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch
Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).
Key changes:
- No global scratch buffer (eliminates CUDA graph incompatibility)
- No separate kernel launch for pre-rotation
- Activation stays in shmem (~14-32 KB) instead of global memory
- Single __syncthreads between rotation and dot product (NOT in inner loop)
- V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)
This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.
Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).
NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.
Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.
Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
vs q8_0 (177 t/s): 39%
PPL: 8.82 (identical)
Comprehensive optimization log (13 versions tested):
cuBLAS baseline: 20 t/s
V1 per-warp WHT, 4 warps: 60 t/s (3.0x)
V3 shmem activation cache: 33 t/s (syncthreads kills it)
V5 multi-warp per row: 62 t/s
V6 LUT (shmem): 37 t/s (sync overhead)
V7 8 warps clean: 62 t/s
V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x)
V9 pre-rot + q8_1: 70 t/s (marginal)
V10 4-elem/thread: 57 t/s
V13 8-elem, 4-thread dot: 45 t/s
NR0=2/4/8: all regressed (register spill / cache thrash)
The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration
- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types
Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
fix: disable upstream attn rotation by default (conflicts with TurboQuant)
Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.
Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add post-unrotate memory barrier for in-layer mixing safety
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.
The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW
Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.
Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch
Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2
Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused CENTROIDS_1BIT constant
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing TurboQuant FA template instances for HIP/ROCm build
The HIP build was missing 9 turbo cross-type flash attention vec
instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.
Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
since those instance files are already excluded from the HIP build
(they exceed HIP's 65536-byte local memory limit).
Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.
Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.
Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.
Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Increase turbo3/turbo2 block size from 32 to 128
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels
Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).
HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
to support both 3-arg and 4-arg calls (CUDA allows defaulting width
to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
on wave64 platforms, turbo code expects 32-bit)
HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
as CUDA CMakeLists, were missing from HIP build)
Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: KV state serialization uses padded tensor width (issue #28 follow-up)
state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.
Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Boundary V (experimental) тАФ layer-aware V compression
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)
The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.
Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.
Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1
Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".
Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).
Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
(turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4
PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).
Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.
Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)
GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.
Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).
ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.
Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
(ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
(always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance
Tested:
GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
DeepSeek: 143 t/s, PPL 11.61 (unchanged)
16/16 coherence, NIAH 9-10/11 at 32K
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention:
- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)
The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.
Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs
Closes #27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: asymmetric K/V quant support for Metal flash attention
- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.
Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.
Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).
Closes #19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)
The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.
All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.
PPL summary (Qwen3.5, head_dim=128, wikitext-2):
f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel
In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.
Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.
Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.
Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: InnerQ per-channel equalization + turbo2 64-group fallback
InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).
Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).
Math: <Q/s, s*K> = <Q, K> preserves dot products.
Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity
Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.
Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7
The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.
Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.
Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.
Block format: block_turbo2_0 = 10 bytes per 32 values
- norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
- qs (8B): 2-bit centroid indices, 4 per byte
Usage: -ctk turbo2 -ctv turbo2
Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration
NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.
Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).
WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).
Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)
Users can now mix turbo3 and q8_0 independently for K and V caches:
-ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality
-ctk q8_0 -ctv turbo3 # opposite tradeoff
Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
(fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
WHT must not fire. For MLA, V is a view of K so v->type correctly
reflects K's turbo type.
Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
turbo3/q8_0: 9/11 (1 extra miss, normal variance)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.
Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
so CPU flash attention can handle turbo3 K/V for models where CUDA FA
returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy
Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.
Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})
Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.
Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].
Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.
Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
тАв Shared-memory WHT butterfly (7 stages, no atomics)
тАв Warp-reduce L2 norm + inter-warp accumulate via smem
тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block тАФ load qs/signs once instead of twice per pair
тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75├Ч q8_0)
After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill
Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).
launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically. Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ
Before (VEC only): After (MMA for prefill):
2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0)
8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0)
32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0)
Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).
Decode unchanged (VEC, ~0.64-0.82…
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.
vulkan: TQ4_1s support for model weights (#69)
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)
fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10
The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.
Users can opt-in with TURBO_FLASH=1 for testing/debugging.
No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: gate turbo V unpad on V type, not K type
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.
Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.
Compared to commit 1 (VEC force), this approach recovers prefill:
pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
Decode: unchanged across all configs
Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).
Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.
Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)
Refs: ggml-org/llama.cpp#20969
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: force VEC FA path for quantized KV on HIP/ROCm
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.
This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).
Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.
Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.
Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).
Refs: ggml-org/llama.cpp#20969 (community reports)
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:
* ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
rounding to a runtime SPIR-V patch and dropped the _rte shader
variants plus rte.glsl itself.
* TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
support against a base that still had those variants.
After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.
Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.
Changes:
* Drop the if (float_controls_rte_fp16) / else branches around
cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
variant, matching upstream post-1f30ac0ce.
* Remove the #include "rte.glsl" from copy_to_quant.comp.
* Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
generator (no MMQ code path for it).
* Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
and the _cm2 counterpart alongside the other quant types.
Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):
* Full HIP+Vulkan build is clean with no shader compile errors.
* test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
* test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
530 cases pass (previously aborted on case 3).
* test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
still green (no HIP regression).
* llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
-ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
correctness issue on the Vulkan turbo3 decode path is pre-existing
and orthogonal to this change.
llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:
F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17
Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks
fix nix build: Add spirv-headers to vulkanBuildInputs
ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81
spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a
but not added to the nix build environment
fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu
The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.
The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.
CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.
Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF
Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline
Fixes #83
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.
The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.
Extend the predicate to include f16 and bf16 alongside turbo + q8_0.
Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.
Co-Authored-By: tturney@psyguard.ai
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check
turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: fix turbo build and test failures
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix GGML_OP_COUNT assertion for RPC
When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.
Fix memory explosion on Apple Silicon
Add GitHub Sponsors funding link
vulkan: add turbo3 backend tests
- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
(f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
vulkan: fix and complete turbo3 KV cache support
feat: add CDNA4 (gfx950/MI355X) support + test results
Add AMD Instinct MI355X (gfx950) architecture support:
Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
* bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8: mfma_i32_16x16x32_i8 (same as CDNA3)
* f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)
MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)
Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
docs: add AMD Instinct MI300X (gfx942) ROCm test results
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.
Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)
MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
metal: add TurboFlash attention kernel for turbo3 KV cache decode
Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.
Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
- Each lane handles DK/32 interleaved dimensions
- Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
- K scoring via simd_sum dot product
- turbo3 V unpack with register codebook (8 centroids)
- Online softmax (m/l/o state) entirely in registers
- Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
- Online softmax correction with global max/sum
- Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
- Eliminates 5 of 7 threadgroup barriers vs naive butterfly
Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).
Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)
Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Vulkan compute shader support for turbo3 KV cache
Full turbo3 quantize/dequant pipeline for Vulkan backend:
- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch
Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).
Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.
Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch
Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)
Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.
Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.
For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0
Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).
Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4
Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
to catch silent corruption that PPL alone misses.
Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
Upstream V12 runtime: 67 t/s (4.5 GiB VRAM)
q8_0 conversion: 176 t/s (7.5 GiB VRAM)
Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR
Quality (vs f16 baseline):
PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55)
Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
NIAH: 5/5
Coherence: 4/4 (Paris, 4, print, Shakespeare)
fix: cap map0 kernel shmem for 256-expert MoE models
Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.
Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.
Fixes #58
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
with flash attention тАФ needs chunked map0 dispatch (follow-up)
Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.
Reported by Madreag on RTX 5090 WSL2.
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Enhance Metal operations for TQ weights and concurrency handling for Gemma 4
- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.
Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.
signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).
Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).
Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).
Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix turbo4 C reference WHT dequant mismatch (#43)
Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.
Affects CPU-only inference fallback path. GPU users unaffected.
Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.
Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)
Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).
Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.
Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows
No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.
Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Windows MSVC build compatibility for TQ weight types
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
to fix C vs C++ name mangling across DLL boundary
All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
C2065: 'M_PI': undeclared identifier
LNK2001: unresolved external symbol turbo3_cpu_wht_group_size
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch
Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).
Key changes:
- No global scratch buffer (eliminates CUDA graph incompatibility)
- No separate kernel launch for pre-rotation
- Activation stays in shmem (~14-32 KB) instead of global memory
- Single __syncthreads between rotation and dot product (NOT in inner loop)
- V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)
This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.
Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).
NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.
Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.
Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
vs q8_0 (177 t/s): 39%
PPL: 8.82 (identical)
Comprehensive optimization log (13 versions tested):
cuBLAS baseline: 20 t/s
V1 per-warp WHT, 4 warps: 60 t/s (3.0x)
V3 shmem activation cache: 33 t/s (syncthreads kills it)
V5 multi-warp per row: 62 t/s
V6 LUT (shmem): 37 t/s (sync overhead)
V7 8 warps clean: 62 t/s
V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x)
V9 pre-rot + q8_1: 70 t/s (marginal)
V10 4-elem/thread: 57 t/s
V13 8-elem, 4-thread dot: 45 t/s
NR0=2/4/8: all regressed (register spill / cache thrash)
The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration
- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types
Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
fix: disable upstream attn rotation by default (conflicts with TurboQuant)
Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.
Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add post-unrotate memory barrier for in-layer mixing safety
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.
The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW
Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.
Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch
Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2
Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused CENTROIDS_1BIT constant
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing TurboQuant FA template instances for HIP/ROCm build
The HIP build was missing 9 turbo cross-type flash attention vec
instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.
Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
since those instance files are already excluded from the HIP build
(they exceed HIP's 65536-byte local memory limit).
Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.
Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.
Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.
Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Increase turbo3/turbo2 block size from 32 to 128
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels
Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).
HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
to support both 3-arg and 4-arg calls (CUDA allows defaulting width
to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
on wave64 platforms, turbo code expects 32-bit)
HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
as CUDA CMakeLists, were missing from HIP build)
Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: KV state serialization uses padded tensor width (issue #28 follow-up)
state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.
Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Boundary V (experimental) тАФ layer-aware V compression
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)
The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.
Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.
Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1
Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".
Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).
Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
(turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4
PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).
Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.
Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)
GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.
Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).
ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.
Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
(ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
(always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance
Tested:
GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
DeepSeek: 143 t/s, PPL 11.61 (unchanged)
16/16 coherence, NIAH 9-10/11 at 32K
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention:
- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)
The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.
Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs
Closes #27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: asymmetric K/V quant support for Metal flash attention
- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.
Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.
Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).
Closes #19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)
The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.
All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.
PPL summary (Qwen3.5, head_dim=128, wikitext-2):
f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel
In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.
Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.
Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.
Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: InnerQ per-channel equalization + turbo2 64-group fallback
InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).
Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).
Math: <Q/s, s*K> = <Q, K> preserves dot products.
Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity
Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.
Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7
The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.
Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.
Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.
Block format: block_turbo2_0 = 10 bytes per 32 values
- norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
- qs (8B): 2-bit centroid indices, 4 per byte
Usage: -ctk turbo2 -ctv turbo2
Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration
NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.
Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).
WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).
Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)
Users can now mix turbo3 and q8_0 independently for K and V caches:
-ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality
-ctk q8_0 -ctv turbo3 # opposite tradeoff
Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
(fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
WHT must not fire. For MLA, V is a view of K so v->type correctly
reflects K's turbo type.
Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
turbo3/q8_0: 9/11 (1 extra miss, normal variance)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.
Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
so CPU flash attention can handle turbo3 K/V for models where CUDA FA
returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy
Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.
Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})
Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.
Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].
Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.
Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
тАв Shared-memory WHT butterfly (7 stages, no atomics)
тАв Warp-reduce L2 norm + inter-warp accumulate via smem
тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block тАФ load qs/signs once instead of twice per pair
тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75├Ч q8_0)
After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill
Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).
launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically. Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ
Before (VEC only): After (MMA for prefill):
2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0)
8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0)
32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0)
Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).
Decode unchanged (VEC, ~0.64-0.82…
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.
vulkan: TQ4_1s support for model weights (#69)
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)
fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10
The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.
Users can opt-in with TURBO_FLASH=1 for testing/debugging.
No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: gate turbo V unpad on V type, not K type
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.
Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.
Compared to commit 1 (VEC force), this approach recovers prefill:
pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
Decode: unchanged across all configs
Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).
Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.
Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)
Refs: ggml-org/llama.cpp#20969
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: force VEC FA path for quantized KV on HIP/ROCm
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.
This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).
Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.
Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.
Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).
Refs: ggml-org/llama.cpp#20969 (community reports)
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:
* ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
rounding to a runtime SPIR-V patch and dropped the _rte shader
variants plus rte.glsl itself.
* TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
support against a base that still had those variants.
After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.
Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.
Changes:
* Drop the if (float_controls_rte_fp16) / else branches around
cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
variant, matching upstream post-1f30ac0ce.
* Remove the #include "rte.glsl" from copy_to_quant.comp.
* Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
generator (no MMQ code path for it).
* Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
and the _cm2 counterpart alongside the other quant types.
Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):
* Full HIP+Vulkan build is clean with no shader compile errors.
* test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
* test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
530 cases pass (previously aborted on case 3).
* test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
still green (no HIP regression).
* llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
-ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
correctness issue on the Vulkan turbo3 decode path is pre-existing
and orthogonal to this change.
llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:
F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17
Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks
fix nix build: Add spirv-headers to vulkanBuildInputs
ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81
spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a
but not added to the nix build environment
fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu
The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.
The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.
CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.
Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF
Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline
Fixes #83
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.
The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.
Extend the predicate to include f16 and bf16 alongside turbo + q8_0.
Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.
Co-Authored-By: tturney@psyguard.ai
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check
turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: fix turbo build and test failures
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix GGML_OP_COUNT assertion for RPC
When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.
Fix memory explosion on Apple Silicon
Add GitHub Sponsors funding link
vulkan: add turbo3 backend tests
- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
(f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
vulkan: fix and complete turbo3 KV cache support
feat: add CDNA4 (gfx950/MI355X) support + test results
Add AMD Instinct MI355X (gfx950) architecture support:
Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
* bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8: mfma_i32_16x16x32_i8 (same as CDNA3)
* f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)
MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)
Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
docs: add AMD Instinct MI300X (gfx942) ROCm test results
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.
Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)
MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
metal: add TurboFlash attention kernel for turbo3 KV cache decode
Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.
Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
- Each lane handles DK/32 interleaved dimensions
- Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
- K scoring via simd_sum dot product
- turbo3 V unpack with register codebook (8 centroids)
- Online softmax (m/l/o state) entirely in registers
- Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
- Online softmax correction with global max/sum
- Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
- Eliminates 5 of 7 threadgroup barriers vs naive butterfly
Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).
Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)
Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Vulkan compute shader support for turbo3 KV cache
Full turbo3 quantize/dequant pipeline for Vulkan backend:
- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch
Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).
Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.
Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch
Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)
Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.
Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.
For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0
Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).
Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4
Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
to catch silent corruption that PPL alone misses.
Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
Upstream V12 runtime: 67 t/s (4.5 GiB VRAM)
q8_0 conversion: 176 t/s (7.5 GiB VRAM)
Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR
Quality (vs f16 baseline):
PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55)
Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
NIAH: 5/5
Coherence: 4/4 (Paris, 4, print, Shakespeare)
fix: cap map0 kernel shmem for 256-expert MoE models
Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.
Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.
Fixes #58
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
with flash attention тАФ needs chunked map0 dispatch (follow-up)
Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.
Reported by Madreag on RTX 5090 WSL2.
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Enhance Metal operations for TQ weights and concurrency handling for Gemma 4
- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.
Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.
signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).
Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).
Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).
Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix turbo4 C reference WHT dequant mismatch (#43)
Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.
Affects CPU-only inference fallback path. GPU users unaffected.
Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.
Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)
Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).
Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.
Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows
No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.
Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Windows MSVC build compatibility for TQ weight types
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
to fix C vs C++ name mangling across DLL boundary
All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
C2065: 'M_PI': undeclared identifier
LNK2001: unresolved external symbol turbo3_cpu_wht_group_size
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch
Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).
Key changes:
- No global scratch buffer (eliminates CUDA graph incompatibility)
- No separate kernel launch for pre-rotation
- Activation stays in shmem (~14-32 KB) instead of global memory
- Single __syncthreads between rotation and dot product (NOT in inner loop)
- V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)
This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.
Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).
NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.
Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.
Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
vs q8_0 (177 t/s): 39%
PPL: 8.82 (identical)
Comprehensive optimization log (13 versions tested):
cuBLAS baseline: 20 t/s
V1 per-warp WHT, 4 warps: 60 t/s (3.0x)
V3 shmem activation cache: 33 t/s (syncthreads kills it)
V5 multi-warp per row: 62 t/s
V6 LUT (shmem): 37 t/s (sync overhead)
V7 8 warps clean: 62 t/s
V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x)
V9 pre-rot + q8_1: 70 t/s (marginal)
V10 4-elem/thread: 57 t/s
V13 8-elem, 4-thread dot: 45 t/s
NR0=2/4/8: all regressed (register spill / cache thrash)
The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration
- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types
Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
fix: disable upstream attn rotation by default (conflicts with TurboQuant)
Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.
Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add post-unrotate memory barrier for in-layer mixing safety
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.
The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW
Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.
Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch
Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2
Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused CENTROIDS_1BIT constant
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing TurboQuant FA template instances for HIP/ROCm build
The HIP build was missing 9 turbo cross-type flash attention vec
instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.
Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
since those instance files are already excluded from the HIP build
(they exceed HIP's 65536-byte local memory limit).
Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.
Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.
Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.
Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Increase turbo3/turbo2 block size from 32 to 128
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels
Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).
HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
to support both 3-arg and 4-arg calls (CUDA allows defaulting width
to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
on wave64 platforms, turbo code expects 32-bit)
HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
as CUDA CMakeLists, were missing from HIP build)
Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: KV state serialization uses padded tensor width (issue #28 follow-up)
state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.
Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Boundary V (experimental) тАФ layer-aware V compression
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)
The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.
Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.
Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1
Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".
Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).
Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
(turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4
PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).
Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.
Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)
GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.
Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).
ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.
Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
(ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
(always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance
Tested:
GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
DeepSeek: 143 t/s, PPL 11.61 (unchanged)
16/16 coherence, NIAH 9-10/11 at 32K
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention:
- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)
The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.
Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs
Closes #27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: asymmetric K/V quant support for Metal flash attention
- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.
Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.
Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).
Closes #19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)
The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.
All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.
PPL summary (Qwen3.5, head_dim=128, wikitext-2):
f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel
In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.
Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.
Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.
Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: InnerQ per-channel equalization + turbo2 64-group fallback
InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).
Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).
Math: <Q/s, s*K> = <Q, K> preserves dot products.
Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity
Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.
Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7
The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.
Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.
Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.
Block format: block_turbo2_0 = 10 bytes per 32 values
- norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
- qs (8B): 2-bit centroid indices, 4 per byte
Usage: -ctk turbo2 -ctv turbo2
Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration
NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.
Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).
WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).
Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)
Users can now mix turbo3 and q8_0 independently for K and V caches:
-ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality
-ctk q8_0 -ctv turbo3 # opposite tradeoff
Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
(fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
WHT must not fire. For MLA, V is a view of K so v->type correctly
reflects K's turbo type.
Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
turbo3/q8_0: 9/11 (1 extra miss, normal variance)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.
Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
so CPU flash attention can handle turbo3 K/V for models where CUDA FA
returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy
Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.
Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})
Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.
Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].
Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.
Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
тАв Shared-memory WHT butterfly (7 stages, no atomics)
тАв Warp-reduce L2 norm + inter-warp accumulate via smem
тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block тАФ load qs/signs once instead of twice per pair
тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75├Ч q8_0)
After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill
Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).
launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically. Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ
Before (VEC only): After (MMA for prefill):
2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0)
8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0)
32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0)
Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).
Decode unchanged (VEC, ~0.64-0.82…
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.
vulkan: TQ4_1s support for model weights (#69)
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)
fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10
The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.
Users can opt-in with TURBO_FLASH=1 for testing/debugging.
No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: gate turbo V unpad on V type, not K type
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.
Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.
Compared to commit 1 (VEC force), this approach recovers prefill:
pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
Decode: unchanged across all configs
Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).
Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.
Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)
Refs: ggml-org/llama.cpp#20969
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: force VEC FA path for quantized KV on HIP/ROCm
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.
This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).
Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.
Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.
Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).
Refs: ggml-org/llama.cpp#20969 (community reports)
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:
* ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
rounding to a runtime SPIR-V patch and dropped the _rte shader
variants plus rte.glsl itself.
* TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
support against a base that still had those variants.
After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.
Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.
Changes:
* Drop the if (float_controls_rte_fp16) / else branches around
cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
variant, matching upstream post-1f30ac0ce.
* Remove the #include "rte.glsl" from copy_to_quant.comp.
* Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
generator (no MMQ code path for it).
* Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
and the _cm2 counterpart alongside the other quant types.
Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):
* Full HIP+Vulkan build is clean with no shader compile errors.
* test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
* test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
530 cases pass (previously aborted on case 3).
* test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
still green (no HIP regression).
* llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
-ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
correctness issue on the Vulkan turbo3 decode path is pre-existing
and orthogonal to this change.
llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:
F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17
Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks
fix nix build: Add spirv-headers to vulkanBuildInputs
ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81
spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a
but not added to the nix build environment
fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu
The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.
The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.
CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.
Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF
Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline
Fixes #83
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.
The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.
Extend the predicate to include f16 and bf16 alongside turbo + q8_0.
Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.
Co-Authored-By: tturney@psyguard.ai
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check
turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: fix turbo build and test failures
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix GGML_OP_COUNT assertion for RPC
When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.
Fix memory explosion on Apple Silicon
Add GitHub Sponsors funding link
vulkan: add turbo3 backend tests
- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
(f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
vulkan: fix and complete turbo3 KV cache support
feat: add CDNA4 (gfx950/MI355X) support + test results
Add AMD Instinct MI355X (gfx950) architecture support:
Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
* bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8: mfma_i32_16x16x32_i8 (same as CDNA3)
* f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)
MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)
Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
docs: add AMD Instinct MI300X (gfx942) ROCm test results
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.
Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)
MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
metal: add TurboFlash attention kernel for turbo3 KV cache decode
Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.
Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
- Each lane handles DK/32 interleaved dimensions
- Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
- K scoring via simd_sum dot product
- turbo3 V unpack with register codebook (8 centroids)
- Online softmax (m/l/o state) entirely in registers
- Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
- Online softmax correction with global max/sum
- Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
- Eliminates 5 of 7 threadgroup barriers vs naive butterfly
Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).
Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)
Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Vulkan compute shader support for turbo3 KV cache
Full turbo3 quantize/dequant pipeline for Vulkan backend:
- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch
Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).
Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.
Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch
Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)
Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.
Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.
For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0
Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).
Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4
Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
to catch silent corruption that PPL alone misses.
Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
Upstream V12 runtime: 67 t/s (4.5 GiB VRAM)
q8_0 conversion: 176 t/s (7.5 GiB VRAM)
Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR
Quality (vs f16 baseline):
PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55)
Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
NIAH: 5/5
Coherence: 4/4 (Paris, 4, print, Shakespeare)
fix: cap map0 kernel shmem for 256-expert MoE models
Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.
Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.
Fixes #58
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
with flash attention тАФ needs chunked map0 dispatch (follow-up)
Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.
Reported by Madreag on RTX 5090 WSL2.
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Enhance Metal operations for TQ weights and concurrency handling for Gemma 4
- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.
Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.
signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).
Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).
Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).
Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix turbo4 C reference WHT dequant mismatch (#43)
Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.
Affects CPU-only inference fallback path. GPU users unaffected.
Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.
Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)
Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).
Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.
Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows
No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.
Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Windows MSVC build compatibility for TQ weight types
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
to fix C vs C++ name mangling across DLL boundary
All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
C2065: 'M_PI': undeclared identifier
LNK2001: unresolved external symbol turbo3_cpu_wht_group_size
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch
Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).
Key changes:
- No global scratch buffer (eliminates CUDA graph incompatibility)
- No separate kernel launch for pre-rotation
- Activation stays in shmem (~14-32 KB) instead of global memory
- Single __syncthreads between rotation and dot product (NOT in inner loop)
- V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)
This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.
Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).
NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.
Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.
Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
vs q8_0 (177 t/s): 39%
PPL: 8.82 (identical)
Comprehensive optimization log (13 versions tested):
cuBLAS baseline: 20 t/s
V1 per-warp WHT, 4 warps: 60 t/s (3.0x)
V3 shmem activation cache: 33 t/s (syncthreads kills it)
V5 multi-warp per row: 62 t/s
V6 LUT (shmem): 37 t/s (sync overhead)
V7 8 warps clean: 62 t/s
V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x)
V9 pre-rot + q8_1: 70 t/s (marginal)
V10 4-elem/thread: 57 t/s
V13 8-elem, 4-thread dot: 45 t/s
NR0=2/4/8: all regressed (register spill / cache thrash)
The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration
- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types
Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
fix: disable upstream attn rotation by default (conflicts with TurboQuant)
Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.
Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add post-unrotate memory barrier for in-layer mixing safety
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.
The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW
Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.
Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch
Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2
Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused CENTROIDS_1BIT constant
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing TurboQuant FA template instances for HIP/ROCm build
The HIP build was missing 9 turbo cross-type flash attention vec
instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.
Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
since those instance files are already excluded from the HIP build
(they exceed HIP's 65536-byte local memory limit).
Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.
Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.
Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.
Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Increase turbo3/turbo2 block size from 32 to 128
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels
Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).
HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
to support both 3-arg and 4-arg calls (CUDA allows defaulting width
to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
on wave64 platforms, turbo code expects 32-bit)
HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
as CUDA CMakeLists, were missing from HIP build)
Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: KV state serialization uses padded tensor width (issue #28 follow-up)
state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.
Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Boundary V (experimental) тАФ layer-aware V compression
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)
The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.
Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.
Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1
Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".
Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).
Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
(turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4
PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).
Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.
Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)
GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.
Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).
ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.
Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
(ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
(always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance
Tested:
GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
DeepSeek: 143 t/s, PPL 11.61 (unchanged)
16/16 coherence, NIAH 9-10/11 at 32K
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention:
- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)
The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.
Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs
Closes #27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: asymmetric K/V quant support for Metal flash attention
- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.
Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.
Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).
Closes #19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)
The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.
All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.
PPL summary (Qwen3.5, head_dim=128, wikitext-2):
f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel
In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.
Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.
Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.
Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: InnerQ per-channel equalization + turbo2 64-group fallback
InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).
Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).
Math: <Q/s, s*K> = <Q, K> preserves dot products.
Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity
Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.
Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7
The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.
Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.
Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.
Block format: block_turbo2_0 = 10 bytes per 32 values
- norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
- qs (8B): 2-bit centroid indices, 4 per byte
Usage: -ctk turbo2 -ctv turbo2
Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration
NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.
Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).
WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).
Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)
Users can now mix turbo3 and q8_0 independently for K and V caches:
-ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality
-ctk q8_0 -ctv turbo3 # opposite tradeoff
Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
(fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
WHT must not fire. For MLA, V is a view of K so v->type correctly
reflects K's turbo type.
Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
turbo3/q8_0: 9/11 (1 extra miss, normal variance)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.
Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
so CPU flash attention can handle turbo3 K/V for models where CUDA FA
returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy
Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.
Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})
Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.
Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].
Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.
Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
тАв Shared-memory WHT butterfly (7 stages, no atomics)
тАв Warp-reduce L2 norm + inter-warp accumulate via smem
тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block тАФ load qs/signs once instead of twice per pair
тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75├Ч q8_0)
After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill
Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).
launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically. Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ
Before (VEC only): After (MMA for prefill):
2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0)
8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0)
32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0)
Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).
Decode unchanged (VEC, ~0.64-0.82…
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.
vulkan: TQ4_1s support for model weights (#69)
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)
fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10
The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.
Users can opt-in with TURBO_FLASH=1 for testing/debugging.
No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: gate turbo V unpad on V type, not K type
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.
Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.
Compared to commit 1 (VEC force), this approach recovers prefill:
pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
Decode: unchanged across all configs
Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).
Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.
Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)
Refs: ggml-org/llama.cpp#20969
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: force VEC FA path for quantized KV on HIP/ROCm
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.
This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).
Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.
Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.
Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).
Refs: ggml-org/llama.cpp#20969 (community reports)
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:
* ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
rounding to a runtime SPIR-V patch and dropped the _rte shader
variants plus rte.glsl itself.
* TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
support against a base that still had those variants.
After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.
Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.
Changes:
* Drop the if (float_controls_rte_fp16) / else branches around
cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
variant, matching upstream post-1f30ac0ce.
* Remove the #include "rte.glsl" from copy_to_quant.comp.
* Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
generator (no MMQ code path for it).
* Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
and the _cm2 counterpart alongside the other quant types.
Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):
* Full HIP+Vulkan build is clean with no shader compile errors.
* test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
* test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
530 cases pass (previously aborted on case 3).
* test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
still green (no HIP regression).
* llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
-ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
correctness issue on the Vulkan turbo3 decode path is pre-existing
and orthogonal to this change.
llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:
F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17
Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks
fix nix build: Add spirv-headers to vulkanBuildInputs
ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81
spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a
but not added to the nix build environment
fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu
The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.
The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.
CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.
Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF
Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline
Fixes #83
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.
The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.
Extend the predicate to include f16 and bf16 alongside turbo + q8_0.
Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.
Co-Authored-By: tturney@psyguard.ai
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check
turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: fix turbo build and test failures
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix GGML_OP_COUNT assertion for RPC
When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.
Fix memory explosion on Apple Silicon
Add GitHub Sponsors funding link
vulkan: add turbo3 backend tests
- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
(f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
vulkan: fix and complete turbo3 KV cache support
feat: add CDNA4 (gfx950/MI355X) support + test results
Add AMD Instinct MI355X (gfx950) architecture support:
Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
* bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8: mfma_i32_16x16x32_i8 (same as CDNA3)
* f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)
MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)
Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
docs: add AMD Instinct MI300X (gfx942) ROCm test results
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.
Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)
MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
metal: add TurboFlash attention kernel for turbo3 KV cache decode
Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.
Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
- Each lane handles DK/32 interleaved dimensions
- Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
- K scoring via simd_sum dot product
- turbo3 V unpack with register codebook (8 centroids)
- Online softmax (m/l/o state) entirely in registers
- Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
- Online softmax correction with global max/sum
- Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
- Eliminates 5 of 7 threadgroup barriers vs naive butterfly
Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).
Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)
Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Vulkan compute shader support for turbo3 KV cache
Full turbo3 quantize/dequant pipeline for Vulkan backend:
- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch
Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).
Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.
Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch
Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)
Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.
Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.
For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0
Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).
Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4
Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
to catch silent corruption that PPL alone misses.
Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
Upstream V12 runtime: 67 t/s (4.5 GiB VRAM)
q8_0 conversion: 176 t/s (7.5 GiB VRAM)
Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR
Quality (vs f16 baseline):
PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55)
Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
NIAH: 5/5
Coherence: 4/4 (Paris, 4, print, Shakespeare)
fix: cap map0 kernel shmem for 256-expert MoE models
Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.
Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.
Fixes #58
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
with flash attention тАФ needs chunked map0 dispatch (follow-up)
Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.
Reported by Madreag on RTX 5090 WSL2.
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Enhance Metal operations for TQ weights and concurrency handling for Gemma 4
- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.
Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.
signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).
Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).
Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).
Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix turbo4 C reference WHT dequant mismatch (#43)
Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.
Affects CPU-only inference fallback path. GPU users unaffected.
Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.
Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)
Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).
Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.
Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows
No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.
Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Windows MSVC build compatibility for TQ weight types
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
to fix C vs C++ name mangling across DLL boundary
All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
C2065: 'M_PI': undeclared identifier
LNK2001: unresolved external symbol turbo3_cpu_wht_group_size
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch
Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).
Key changes:
- No global scratch buffer (eliminates CUDA graph incompatibility)
- No separate kernel launch for pre-rotation
- Activation stays in shmem (~14-32 KB) instead of global memory
- Single __syncthreads between rotation and dot product (NOT in inner loop)
- V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)
This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.
Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).
NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.
Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.
Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
vs q8_0 (177 t/s): 39%
PPL: 8.82 (identical)
Comprehensive optimization log (13 versions tested):
cuBLAS baseline: 20 t/s
V1 per-warp WHT, 4 warps: 60 t/s (3.0x)
V3 shmem activation cache: 33 t/s (syncthreads kills it)
V5 multi-warp per row: 62 t/s
V6 LUT (shmem): 37 t/s (sync overhead)
V7 8 warps clean: 62 t/s
V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x)
V9 pre-rot + q8_1: 70 t/s (marginal)
V10 4-elem/thread: 57 t/s
V13 8-elem, 4-thread dot: 45 t/s
NR0=2/4/8: all regressed (register spill / cache thrash)
The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration
- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types
Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
fix: disable upstream attn rotation by default (conflicts with TurboQuant)
Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.
Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add post-unrotate memory barrier for in-layer mixing safety
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.
The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW
Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.
Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch
Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2
Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused CENTROIDS_1BIT constant
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing TurboQuant FA template instances for HIP/ROCm build
The HIP build was missing 9 turbo cross-type flash attention vec
instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.
Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
since those instance files are already excluded from the HIP build
(they exceed HIP's 65536-byte local memory limit).
Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.
Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.
Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.
Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Increase turbo3/turbo2 block size from 32 to 128
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels
Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).
HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
to support both 3-arg and 4-arg calls (CUDA allows defaulting width
to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
on wave64 platforms, turbo code expects 32-bit)
HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
as CUDA CMakeLists, were missing from HIP build)
Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: KV state serialization uses padded tensor width (issue #28 follow-up)
state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.
Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Boundary V (experimental) тАФ layer-aware V compression
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)
The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.
Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.
Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1
Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".
Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).
Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
(turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4
PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).
Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.
Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)
GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.
Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).
ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.
Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
(ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
(always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance
Tested:
GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
DeepSeek: 143 t/s, PPL 11.61 (unchanged)
16/16 coherence, NIAH 9-10/11 at 32K
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention:
- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)
The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.
Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs
Closes #27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: asymmetric K/V quant support for Metal flash attention
- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.
Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.
Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).
Closes #19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)
The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.
All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.
PPL summary (Qwen3.5, head_dim=128, wikitext-2):
f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel
In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.
Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.
Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.
Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: InnerQ per-channel equalization + turbo2 64-group fallback
InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).
Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).
Math: <Q/s, s*K> = <Q, K> preserves dot products.
Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity
Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.
Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7
The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.
Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.
Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.
Block format: block_turbo2_0 = 10 bytes per 32 values
- norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
- qs (8B): 2-bit centroid indices, 4 per byte
Usage: -ctk turbo2 -ctv turbo2
Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration
NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.
Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).
WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).
Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)
Users can now mix turbo3 and q8_0 independently for K and V caches:
-ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality
-ctk q8_0 -ctv turbo3 # opposite tradeoff
Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
(fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
WHT must not fire. For MLA, V is a view of K so v->type correctly
reflects K's turbo type.
Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
turbo3/q8_0: 9/11 (1 extra miss, normal variance)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.
Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
so CPU flash attention can handle turbo3 K/V for models where CUDA FA
returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy
Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.
Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})
Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.
Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].
Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.
Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
тАв Shared-memory WHT butterfly (7 stages, no atomics)
тАв Warp-reduce L2 norm + inter-warp accumulate via smem
тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block тАФ load qs/signs once instead of twice per pair
тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75├Ч q8_0)
After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill
Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).
launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically. Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ
Before (VEC only): After (MMA for prefill):
2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0)
8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0)
32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0)
Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).
Decode unchanged (VEC, ~0.64-0.82…
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.
vulkan: TQ4_1s support for model weights (#69)
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)
fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10
The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.
Users can opt-in with TURBO_FLASH=1 for testing/debugging.
No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: gate turbo V unpad on V type, not K type
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.
Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.
Compared to commit 1 (VEC force), this approach recovers prefill:
pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
Decode: unchanged across all configs
Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).
Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.
Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)
Refs: ggml-org/llama.cpp#20969
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: force VEC FA path for quantized KV on HIP/ROCm
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.
This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).
Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.
Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.
Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).
Refs: ggml-org/llama.cpp#20969 (community reports)
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:
* ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
rounding to a runtime SPIR-V patch and dropped the _rte shader
variants plus rte.glsl itself.
* TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
support against a base that still had those variants.
After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.
Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.
Changes:
* Drop the if (float_controls_rte_fp16) / else branches around
cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
variant, matching upstream post-1f30ac0ce.
* Remove the #include "rte.glsl" from copy_to_quant.comp.
* Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
generator (no MMQ code path for it).
* Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
and the _cm2 counterpart alongside the other quant types.
Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):
* Full HIP+Vulkan build is clean with no shader compile errors.
* test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
* test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
530 cases pass (previously aborted on case 3).
* test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
still green (no HIP regression).
* llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
-ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
correctness issue on the Vulkan turbo3 decode path is pre-existing
and orthogonal to this change.
llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:
F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17
Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks
fix nix build: Add spirv-headers to vulkanBuildInputs
ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81
spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a
but not added to the nix build environment
fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu
The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.
The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.
CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.
Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF
Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline
Fixes #83
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.
The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.
Extend the predicate to include f16 and bf16 alongside turbo + q8_0.
Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.
Co-Authored-By: tturney@psyguard.ai
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check
turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: fix turbo build and test failures
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix GGML_OP_COUNT assertion for RPC
When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.
Fix memory explosion on Apple Silicon
Add GitHub Sponsors funding link
vulkan: add turbo3 backend tests
- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
(f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
vulkan: fix and complete turbo3 KV cache support
feat: add CDNA4 (gfx950/MI355X) support + test results
Add AMD Instinct MI355X (gfx950) architecture support:
Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
* bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8: mfma_i32_16x16x32_i8 (same as CDNA3)
* f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)
MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)
Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
docs: add AMD Instinct MI300X (gfx942) ROCm test results
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.
Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)
MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
metal: add TurboFlash attention kernel for turbo3 KV cache decode
Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.
Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
- Each lane handles DK/32 interleaved dimensions
- Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
- K scoring via simd_sum dot product
- turbo3 V unpack with register codebook (8 centroids)
- Online softmax (m/l/o state) entirely in registers
- Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
- Online softmax correction with global max/sum
- Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
- Eliminates 5 of 7 threadgroup barriers vs naive butterfly
Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).
Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)
Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Vulkan compute shader support for turbo3 KV cache
Full turbo3 quantize/dequant pipeline for Vulkan backend:
- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch
Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).
Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.
Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch
Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)
Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.
Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.
For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0
Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).
Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4
Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
to catch silent corruption that PPL alone misses.
Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
Upstream V12 runtime: 67 t/s (4.5 GiB VRAM)
q8_0 conversion: 176 t/s (7.5 GiB VRAM)
Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR
Quality (vs f16 baseline):
PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55)
Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
NIAH: 5/5
Coherence: 4/4 (Paris, 4, print, Shakespeare)
fix: cap map0 kernel shmem for 256-expert MoE models
Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.
Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.
Fixes #58
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
with flash attention тАФ needs chunked map0 dispatch (follow-up)
Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.
Reported by Madreag on RTX 5090 WSL2.
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Enhance Metal operations for TQ weights and concurrency handling for Gemma 4
- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.
Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.
signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).
Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).
Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).
Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix turbo4 C reference WHT dequant mismatch (#43)
Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.
Affects CPU-only inference fallback path. GPU users unaffected.
Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.
Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)
Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).
Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.
Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows
No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.
Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Windows MSVC build compatibility for TQ weight types
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
to fix C vs C++ name mangling across DLL boundary
All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
C2065: 'M_PI': undeclared identifier
LNK2001: unresolved external symbol turbo3_cpu_wht_group_size
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch
Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).
Key changes:
- No global scratch buffer (eliminates CUDA graph incompatibility)
- No separate kernel launch for pre-rotation
- Activation stays in shmem (~14-32 KB) instead of global memory
- Single __syncthreads between rotation and dot product (NOT in inner loop)
- V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)
This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.
Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).
NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.
Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.
Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
vs q8_0 (177 t/s): 39%
PPL: 8.82 (identical)
Comprehensive optimization log (13 versions tested):
cuBLAS baseline: 20 t/s
V1 per-warp WHT, 4 warps: 60 t/s (3.0x)
V3 shmem activation cache: 33 t/s (syncthreads kills it)
V5 multi-warp per row: 62 t/s
V6 LUT (shmem): 37 t/s (sync overhead)
V7 8 warps clean: 62 t/s
V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x)
V9 pre-rot + q8_1: 70 t/s (marginal)
V10 4-elem/thread: 57 t/s
V13 8-elem, 4-thread dot: 45 t/s
NR0=2/4/8: all regressed (register spill / cache thrash)
The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration
- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types
Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
fix: disable upstream attn rotation by default (conflicts with TurboQuant)
Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.
Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add post-unrotate memory barrier for in-layer mixing safety
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.
The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW
Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.
Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch
Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2
Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused CENTROIDS_1BIT constant
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing TurboQuant FA template instances for HIP/ROCm build
The HIP build was missing 9 turbo cross-type flash attention vec
instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.
Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
since those instance files are already excluded from the HIP build
(they exceed HIP's 65536-byte local memory limit).
Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.
Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.
Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.
Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Increase turbo3/turbo2 block size from 32 to 128
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels
Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).
HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
to support both 3-arg and 4-arg calls (CUDA allows defaulting width
to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
on wave64 platforms, turbo code expects 32-bit)
HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
as CUDA CMakeLists, were missing from HIP build)
Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: KV state serialization uses padded tensor width (issue #28 follow-up)
state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.
Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Boundary V (experimental) тАФ layer-aware V compression
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)
The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.
Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.
Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1
Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".
Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).
Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
(turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4
PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).
Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.
Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)
GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.
Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).
ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.
Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
(ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
(always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance
Tested:
GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
DeepSeek: 143 t/s, PPL 11.61 (unchanged)
16/16 coherence, NIAH 9-10/11 at 32K
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention:
- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)
The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.
Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs
Closes #27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: asymmetric K/V quant support for Metal flash attention
- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.
Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.
Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).
Closes #19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)
The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.
All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.
PPL summary (Qwen3.5, head_dim=128, wikitext-2):
f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel
In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.
Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.
Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.
Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: InnerQ per-channel equalization + turbo2 64-group fallback
InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).
Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).
Math: <Q/s, s*K> = <Q, K> preserves dot products.
Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity
Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.
Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7
The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.
Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.
Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.
Block format: block_turbo2_0 = 10 bytes per 32 values
- norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
- qs (8B): 2-bit centroid indices, 4 per byte
Usage: -ctk turbo2 -ctv turbo2
Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration
NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.
Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).
WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).
Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)
Users can now mix turbo3 and q8_0 independently for K and V caches:
-ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality
-ctk q8_0 -ctv turbo3 # opposite tradeoff
Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
(fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
WHT must not fire. For MLA, V is a view of K so v->type correctly
reflects K's turbo type.
Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
turbo3/q8_0: 9/11 (1 extra miss, normal variance)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.
Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
so CPU flash attention can handle turbo3 K/V for models where CUDA FA
returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy
Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.
Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})
Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.
Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].
Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.
Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
тАв Shared-memory WHT butterfly (7 stages, no atomics)
тАв Warp-reduce L2 norm + inter-warp accumulate via smem
тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block тАФ load qs/signs once instead of twice per pair
тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75├Ч q8_0)
After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill
Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).
launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically. Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ
Before (VEC only): After (MMA for prefill):
2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0)
8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0)
32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0)
Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).
Decode unchanged (VEC, ~0.64-0.82…
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.
vulkan: TQ4_1s support for model weights (#69)
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)
fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10
The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.
Users can opt-in with TURBO_FLASH=1 for testing/debugging.
No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: gate turbo V unpad on V type, not K type
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.
Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.
Compared to commit 1 (VEC force), this approach recovers prefill:
pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
Decode: unchanged across all configs
Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).
Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.
Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)
Refs: ggml-org/llama.cpp#20969
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: force VEC FA path for quantized KV on HIP/ROCm
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.
This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).
Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.
Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.
Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).
Refs: ggml-org/llama.cpp#20969 (community reports)
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:
* ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
rounding to a runtime SPIR-V patch and dropped the _rte shader
variants plus rte.glsl itself.
* TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
support against a base that still had those variants.
After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.
Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.
Changes:
* Drop the if (float_controls_rte_fp16) / else branches around
cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
variant, matching upstream post-1f30ac0ce.
* Remove the #include "rte.glsl" from copy_to_quant.comp.
* Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
generator (no MMQ code path for it).
* Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
and the _cm2 counterpart alongside the other quant types.
Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):
* Full HIP+Vulkan build is clean with no shader compile errors.
* test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
* test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
530 cases pass (previously aborted on case 3).
* test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
still green (no HIP regression).
* llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
-ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
correctness issue on the Vulkan turbo3 decode path is pre-existing
and orthogonal to this change.
llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:
F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17
Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks
fix nix build: Add spirv-headers to vulkanBuildInputs
ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81
spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a
but not added to the nix build environment
fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu
The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.
The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.
CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.
Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF
Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline
Fixes #83
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.
The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.
Extend the predicate to include f16 and bf16 alongside turbo + q8_0.
Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.
Co-Authored-By: tturney@psyguard.ai
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check
turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: fix turbo build and test failures
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix GGML_OP_COUNT assertion for RPC
When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.
Fix memory explosion on Apple Silicon
Add GitHub Sponsors funding link
vulkan: add turbo3 backend tests
- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
(f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
vulkan: fix and complete turbo3 KV cache support
feat: add CDNA4 (gfx950/MI355X) support + test results
Add AMD Instinct MI355X (gfx950) architecture support:
Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
* bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8: mfma_i32_16x16x32_i8 (same as CDNA3)
* f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)
MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)
Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
docs: add AMD Instinct MI300X (gfx942) ROCm test results
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.
Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)
MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
metal: add TurboFlash attention kernel for turbo3 KV cache decode
Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.
Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
- Each lane handles DK/32 interleaved dimensions
- Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
- K scoring via simd_sum dot product
- turbo3 V unpack with register codebook (8 centroids)
- Online softmax (m/l/o state) entirely in registers
- Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
- Online softmax correction with global max/sum
- Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
- Eliminates 5 of 7 threadgroup barriers vs naive butterfly
Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).
Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)
Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Vulkan compute shader support for turbo3 KV cache
Full turbo3 quantize/dequant pipeline for Vulkan backend:
- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch
Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).
Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.
Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch
Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)
Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.
Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.
For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0
Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).
Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4
Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
to catch silent corruption that PPL alone misses.
Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
Upstream V12 runtime: 67 t/s (4.5 GiB VRAM)
q8_0 conversion: 176 t/s (7.5 GiB VRAM)
Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR
Quality (vs f16 baseline):
PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55)
Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
NIAH: 5/5
Coherence: 4/4 (Paris, 4, print, Shakespeare)
fix: cap map0 kernel shmem for 256-expert MoE models
Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.
Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.
Fixes #58
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
with flash attention тАФ needs chunked map0 dispatch (follow-up)
Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.
Reported by Madreag on RTX 5090 WSL2.
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Enhance Metal operations for TQ weights and concurrency handling for Gemma 4
- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.
Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.
signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).
Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).
Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).
Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix turbo4 C reference WHT dequant mismatch (#43)
Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.
Affects CPU-only inference fallback path. GPU users unaffected.
Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.
Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)
Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).
Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.
Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows
No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.
Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Windows MSVC build compatibility for TQ weight types
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
to fix C vs C++ name mangling across DLL boundary
All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
C2065: 'M_PI': undeclared identifier
LNK2001: unresolved external symbol turbo3_cpu_wht_group_size
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch
Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).
Key changes:
- No global scratch buffer (eliminates CUDA graph incompatibility)
- No separate kernel launch for pre-rotation
- Activation stays in shmem (~14-32 KB) instead of global memory
- Single __syncthreads between rotation and dot product (NOT in inner loop)
- V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)
This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.
Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).
NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.
Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.
Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
vs q8_0 (177 t/s): 39%
PPL: 8.82 (identical)
Comprehensive optimization log (13 versions tested):
cuBLAS baseline: 20 t/s
V1 per-warp WHT, 4 warps: 60 t/s (3.0x)
V3 shmem activation cache: 33 t/s (syncthreads kills it)
V5 multi-warp per row: 62 t/s
V6 LUT (shmem): 37 t/s (sync overhead)
V7 8 warps clean: 62 t/s
V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x)
V9 pre-rot + q8_1: 70 t/s (marginal)
V10 4-elem/thread: 57 t/s
V13 8-elem, 4-thread dot: 45 t/s
NR0=2/4/8: all regressed (register spill / cache thrash)
The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration
- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types
Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
fix: disable upstream attn rotation by default (conflicts with TurboQuant)
Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.
Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add post-unrotate memory barrier for in-layer mixing safety
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.
The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW
Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.
Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch
Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2
Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused CENTROIDS_1BIT constant
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing TurboQuant FA template instances for HIP/ROCm build
The HIP build was missing 9 turbo cross-type flash attention vec
instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.
Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
since those instance files are already excluded from the HIP build
(they exceed HIP's 65536-byte local memory limit).
Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.
Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.
Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.
Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Increase turbo3/turbo2 block size from 32 to 128
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels
Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).
HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
to support both 3-arg and 4-arg calls (CUDA allows defaulting width
to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
on wave64 platforms, turbo code expects 32-bit)
HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
as CUDA CMakeLists, were missing from HIP build)
Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: KV state serialization uses padded tensor width (issue #28 follow-up)
state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.
Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Boundary V (experimental) тАФ layer-aware V compression
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)
The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.
Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.
Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1
Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".
Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).
Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
(turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4
PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).
Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.
Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)
GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.
Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).
ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.
Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
(ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
(always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance
Tested:
GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
DeepSeek: 143 t/s, PPL 11.61 (unchanged)
16/16 coherence, NIAH 9-10/11 at 32K
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention:
- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)
The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.
Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs
Closes #27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: asymmetric K/V quant support for Metal flash attention
- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.
Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.
Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).
Closes #19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)
The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.
All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.
PPL summary (Qwen3.5, head_dim=128, wikitext-2):
f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel
In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.
Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.
Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.
Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: InnerQ per-channel equalization + turbo2 64-group fallback
InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).
Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).
Math: <Q/s, s*K> = <Q, K> preserves dot products.
Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity
Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.
Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7
The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.
Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.
Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.
Block format: block_turbo2_0 = 10 bytes per 32 values
- norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
- qs (8B): 2-bit centroid indices, 4 per byte
Usage: -ctk turbo2 -ctv turbo2
Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration
NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.
Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).
WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).
Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)
Users can now mix turbo3 and q8_0 independently for K and V caches:
-ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality
-ctk q8_0 -ctv turbo3 # opposite tradeoff
Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
(fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
WHT must not fire. For MLA, V is a view of K so v->type correctly
reflects K's turbo type.
Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
turbo3/q8_0: 9/11 (1 extra miss, normal variance)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.
Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
so CPU flash attention can handle turbo3 K/V for models where CUDA FA
returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy
Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.
Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})
Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.
Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].
Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.
Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
тАв Shared-memory WHT butterfly (7 stages, no atomics)
тАв Warp-reduce L2 norm + inter-warp accumulate via smem
тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block тАФ load qs/signs once instead of twice per pair
тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75├Ч q8_0)
After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill
Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).
launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically. Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ
Before (VEC only): After (MMA for prefill):
2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0)
8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0)
32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0)
Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).
Decode unchanged (VEC, ~0.64-0.82…
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.
vulkan: TQ4_1s support for model weights (#69)
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)
fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10
The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.
Users can opt-in with TURBO_FLASH=1 for testing/debugging.
No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: gate turbo V unpad on V type, not K type
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.
Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.
Compared to commit 1 (VEC force), this approach recovers prefill:
pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
Decode: unchanged across all configs
Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).
Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.
Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)
Refs: ggml-org/llama.cpp#20969
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: force VEC FA path for quantized KV on HIP/ROCm
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.
This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).
Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.
Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.
Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).
Refs: ggml-org/llama.cpp#20969 (community reports)
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:
* ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
rounding to a runtime SPIR-V patch and dropped the _rte shader
variants plus rte.glsl itself.
* TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
support against a base that still had those variants.
After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.
Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.
Changes:
* Drop the if (float_controls_rte_fp16) / else branches around
cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
variant, matching upstream post-1f30ac0ce.
* Remove the #include "rte.glsl" from copy_to_quant.comp.
* Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
generator (no MMQ code path for it).
* Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
and the _cm2 counterpart alongside the other quant types.
Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):
* Full HIP+Vulkan build is clean with no shader compile errors.
* test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
* test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
530 cases pass (previously aborted on case 3).
* test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
still green (no HIP regression).
* llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
-ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
correctness issue on the Vulkan turbo3 decode path is pre-existing
and orthogonal to this change.
llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:
F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17
Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks
fix nix build: Add spirv-headers to vulkanBuildInputs
ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81
spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a
but not added to the nix build environment
fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu
The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.
The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.
CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.
Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF
Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline
Fixes #83
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.
The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.
Extend the predicate to include f16 and bf16 alongside turbo + q8_0.
Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.
Co-Authored-By: tturney@psyguard.ai
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check
turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: fix turbo build and test failures
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix GGML_OP_COUNT assertion for RPC
When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.
Fix memory explosion on Apple Silicon
Add GitHub Sponsors funding link
vulkan: add turbo3 backend tests
- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
(f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
vulkan: fix and complete turbo3 KV cache support
feat: add CDNA4 (gfx950/MI355X) support + test results
Add AMD Instinct MI355X (gfx950) architecture support:
Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
* bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8: mfma_i32_16x16x32_i8 (same as CDNA3)
* f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)
MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)
Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
docs: add AMD Instinct MI300X (gfx942) ROCm test results
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.
Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)
MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
metal: add TurboFlash attention kernel for turbo3 KV cache decode
Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.
Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
- Each lane handles DK/32 interleaved dimensions
- Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
- K scoring via simd_sum dot product
- turbo3 V unpack with register codebook (8 centroids)
- Online softmax (m/l/o state) entirely in registers
- Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
- Online softmax correction with global max/sum
- Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
- Eliminates 5 of 7 threadgroup barriers vs naive butterfly
Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).
Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)
Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Vulkan compute shader support for turbo3 KV cache
Full turbo3 quantize/dequant pipeline for Vulkan backend:
- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch
Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).
Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.
Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch
Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)
Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.
Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.
For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0
Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).
Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4
Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
to catch silent corruption that PPL alone misses.
Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
Upstream V12 runtime: 67 t/s (4.5 GiB VRAM)
q8_0 conversion: 176 t/s (7.5 GiB VRAM)
Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR
Quality (vs f16 baseline):
PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55)
Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
NIAH: 5/5
Coherence: 4/4 (Paris, 4, print, Shakespeare)
fix: cap map0 kernel shmem for 256-expert MoE models
Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.
Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.
Fixes #58
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
with flash attention тАФ needs chunked map0 dispatch (follow-up)
Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.
Reported by Madreag on RTX 5090 WSL2.
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Enhance Metal operations for TQ weights and concurrency handling for Gemma 4
- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.
Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.
signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).
Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).
Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).
Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix turbo4 C reference WHT dequant mismatch (#43)
Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.
Affects CPU-only inference fallback path. GPU users unaffected.
Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.
Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)
Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).
Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.
Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows
No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.
Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Windows MSVC build compatibility for TQ weight types
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
to fix C vs C++ name mangling across DLL boundary
All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
C2065: 'M_PI': undeclared identifier
LNK2001: unresolved external symbol turbo3_cpu_wht_group_size
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch
Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).
Key changes:
- No global scratch buffer (eliminates CUDA graph incompatibility)
- No separate kernel launch for pre-rotation
- Activation stays in shmem (~14-32 KB) instead of global memory
- Single __syncthreads between rotation and dot product (NOT in inner loop)
- V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)
This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.
Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).
NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.
Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.
Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
vs q8_0 (177 t/s): 39%
PPL: 8.82 (identical)
Comprehensive optimization log (13 versions tested):
cuBLAS baseline: 20 t/s
V1 per-warp WHT, 4 warps: 60 t/s (3.0x)
V3 shmem activation cache: 33 t/s (syncthreads kills it)
V5 multi-warp per row: 62 t/s
V6 LUT (shmem): 37 t/s (sync overhead)
V7 8 warps clean: 62 t/s
V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x)
V9 pre-rot + q8_1: 70 t/s (marginal)
V10 4-elem/thread: 57 t/s
V13 8-elem, 4-thread dot: 45 t/s
NR0=2/4/8: all regressed (register spill / cache thrash)
The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration
- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types
Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
fix: disable upstream attn rotation by default (conflicts with TurboQuant)
Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.
Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add post-unrotate memory barrier for in-layer mixing safety
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.
The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW
Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.
Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch
Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2
Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused CENTROIDS_1BIT constant
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing TurboQuant FA template instances for HIP/ROCm build
The HIP build was missing 9 turbo cross-type flash attention vec
instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.
Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
since those instance files are already excluded from the HIP build
(they exceed HIP's 65536-byte local memory limit).
Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.
Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.
Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.
Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Increase turbo3/turbo2 block size from 32 to 128
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels
Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).
HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
to support both 3-arg and 4-arg calls (CUDA allows defaulting width
to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
on wave64 platforms, turbo code expects 32-bit)
HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
as CUDA CMakeLists, were missing from HIP build)
Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: KV state serialization uses padded tensor width (issue #28 follow-up)
state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.
Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Boundary V (experimental) тАФ layer-aware V compression
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)
The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.
Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.
Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1
Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".
Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).
Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
(turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4
PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).
Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.
Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)
GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.
Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).
ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.
Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
(ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
(always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance
Tested:
GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
DeepSeek: 143 t/s, PPL 11.61 (unchanged)
16/16 coherence, NIAH 9-10/11 at 32K
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention:
- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)
The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.
Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs
Closes #27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: asymmetric K/V quant support for Metal flash attention
- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.
Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.
Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).
Closes #19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)
The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.
All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.
PPL summary (Qwen3.5, head_dim=128, wikitext-2):
f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel
In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.
Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.
Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.
Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: InnerQ per-channel equalization + turbo2 64-group fallback
InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).
Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).
Math: <Q/s, s*K> = <Q, K> preserves dot products.
Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity
Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.
Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7
The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.
Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.
Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.
Block format: block_turbo2_0 = 10 bytes per 32 values
- norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
- qs (8B): 2-bit centroid indices, 4 per byte
Usage: -ctk turbo2 -ctv turbo2
Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration
NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.
Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).
WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).
Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)
Users can now mix turbo3 and q8_0 independently for K and V caches:
-ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality
-ctk q8_0 -ctv turbo3 # opposite tradeoff
Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
(fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
WHT must not fire. For MLA, V is a view of K so v->type correctly
reflects K's turbo type.
Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
turbo3/q8_0: 9/11 (1 extra miss, normal variance)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.
Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
so CPU flash attention can handle turbo3 K/V for models where CUDA FA
returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy
Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.
Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})
Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.
Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].
Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.
Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
тАв Shared-memory WHT butterfly (7 stages, no atomics)
тАв Warp-reduce L2 norm + inter-warp accumulate via smem
тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block тАФ load qs/signs once instead of twice per pair
тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75├Ч q8_0)
After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill
Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).
launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically. Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ
Before (VEC only): After (MMA for prefill):
2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0)
8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0)
32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0)
Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).
Decode unchanged (VEC, ~0.64-0.82…
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.
vulkan: TQ4_1s support for model weights (#69)
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)
fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10
The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.
Users can opt-in with TURBO_FLASH=1 for testing/debugging.
No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: gate turbo V unpad on V type, not K type
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.
Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.
Compared to commit 1 (VEC force), this approach recovers prefill:
pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
Decode: unchanged across all configs
Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).
Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.
Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)
Refs: ggml-org/llama.cpp#20969
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: force VEC FA path for quantized KV on HIP/ROCm
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.
This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).
Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.
Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.
Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).
Refs: ggml-org/llama.cpp#20969 (community reports)
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:
* ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
rounding to a runtime SPIR-V patch and dropped the _rte shader
variants plus rte.glsl itself.
* TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
support against a base that still had those variants.
After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.
Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.
Changes:
* Drop the if (float_controls_rte_fp16) / else branches around
cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
variant, matching upstream post-1f30ac0ce.
* Remove the #include "rte.glsl" from copy_to_quant.comp.
* Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
generator (no MMQ code path for it).
* Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
and the _cm2 counterpart alongside the other quant types.
Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):
* Full HIP+Vulkan build is clean with no shader compile errors.
* test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
* test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
530 cases pass (previously aborted on case 3).
* test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
still green (no HIP regression).
* llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
-ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
correctness issue on the Vulkan turbo3 decode path is pre-existing
and orthogonal to this change.
llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:
F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17
Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks
fix nix build: Add spirv-headers to vulkanBuildInputs
ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81
spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a
but not added to the nix build environment
fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu
The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.
The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.
CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.
Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF
Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline
Fixes #83
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.
The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.
Extend the predicate to include f16 and bf16 alongside turbo + q8_0.
Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.
Co-Authored-By: tturney@psyguard.ai
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check
turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: fix turbo build and test failures
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix GGML_OP_COUNT assertion for RPC
When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.
Fix memory explosion on Apple Silicon
Add GitHub Sponsors funding link
vulkan: add turbo3 backend tests
- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
(f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
vulkan: fix and complete turbo3 KV cache support
feat: add CDNA4 (gfx950/MI355X) support + test results
Add AMD Instinct MI355X (gfx950) architecture support:
Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
* bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8: mfma_i32_16x16x32_i8 (same as CDNA3)
* f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)
MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)
Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
docs: add AMD Instinct MI300X (gfx942) ROCm test results
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.
Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)
MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
metal: add TurboFlash attention kernel for turbo3 KV cache decode
Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.
Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
- Each lane handles DK/32 interleaved dimensions
- Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
- K scoring via simd_sum dot product
- turbo3 V unpack with register codebook (8 centroids)
- Online softmax (m/l/o state) entirely in registers
- Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
- Online softmax correction with global max/sum
- Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
- Eliminates 5 of 7 threadgroup barriers vs naive butterfly
Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).
Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)
Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Vulkan compute shader support for turbo3 KV cache
Full turbo3 quantize/dequant pipeline for Vulkan backend:
- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch
Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).
Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.
Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch
Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)
Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.
Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.
For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0
Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).
Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4
Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
to catch silent corruption that PPL alone misses.
Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
Upstream V12 runtime: 67 t/s (4.5 GiB VRAM)
q8_0 conversion: 176 t/s (7.5 GiB VRAM)
Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR
Quality (vs f16 baseline):
PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55)
Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
NIAH: 5/5
Coherence: 4/4 (Paris, 4, print, Shakespeare)
fix: cap map0 kernel shmem for 256-expert MoE models
Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.
Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.
Fixes #58
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
with flash attention тАФ needs chunked map0 dispatch (follow-up)
Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.
Reported by Madreag on RTX 5090 WSL2.
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Enhance Metal operations for TQ weights and concurrency handling for Gemma 4
- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.
Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.
signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).
Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).
Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).
Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix turbo4 C reference WHT dequant mismatch (#43)
Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.
Affects CPU-only inference fallback path. GPU users unaffected.
Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.
Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)
Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).
Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.
Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows
No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.
Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Windows MSVC build compatibility for TQ weight types
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
to fix C vs C++ name mangling across DLL boundary
All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
C2065: 'M_PI': undeclared identifier
LNK2001: unresolved external symbol turbo3_cpu_wht_group_size
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch
Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).
Key changes:
- No global scratch buffer (eliminates CUDA graph incompatibility)
- No separate kernel launch for pre-rotation
- Activation stays in shmem (~14-32 KB) instead of global memory
- Single __syncthreads between rotation and dot product (NOT in inner loop)
- V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)
This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.
Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).
NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.
Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.
Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
vs q8_0 (177 t/s): 39%
PPL: 8.82 (identical)
Comprehensive optimization log (13 versions tested):
cuBLAS baseline: 20 t/s
V1 per-warp WHT, 4 warps: 60 t/s (3.0x)
V3 shmem activation cache: 33 t/s (syncthreads kills it)
V5 multi-warp per row: 62 t/s
V6 LUT (shmem): 37 t/s (sync overhead)
V7 8 warps clean: 62 t/s
V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x)
V9 pre-rot + q8_1: 70 t/s (marginal)
V10 4-elem/thread: 57 t/s
V13 8-elem, 4-thread dot: 45 t/s
NR0=2/4/8: all regressed (register spill / cache thrash)
The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration
- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types
Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
fix: disable upstream attn rotation by default (conflicts with TurboQuant)
Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.
Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add post-unrotate memory barrier for in-layer mixing safety
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.
The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW
Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.
Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch
Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2
Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused CENTROIDS_1BIT constant
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing TurboQuant FA template instances for HIP/ROCm build
The HIP build was missing 9 turbo cross-type flash attention vec
instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.
Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
since those instance files are already excluded from the HIP build
(they exceed HIP's 65536-byte local memory limit).
Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.
Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.
Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.
Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Increase turbo3/turbo2 block size from 32 to 128
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels
Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).
HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
to support both 3-arg and 4-arg calls (CUDA allows defaulting width
to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
on wave64 platforms, turbo code expects 32-bit)
HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
as CUDA CMakeLists, were missing from HIP build)
Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: KV state serialization uses padded tensor width (issue #28 follow-up)
state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.
Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Boundary V (experimental) тАФ layer-aware V compression
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)
The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.
Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.
Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1
Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".
Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).
Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
(turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4
PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).
Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.
Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)
GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.
Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).
ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.
Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
(ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
(always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance
Tested:
GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
DeepSeek: 143 t/s, PPL 11.61 (unchanged)
16/16 coherence, NIAH 9-10/11 at 32K
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention:
- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)
The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.
Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs
Closes #27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: asymmetric K/V quant support for Metal flash attention
- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.
Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.
Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).
Closes #19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)
The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.
All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.
PPL summary (Qwen3.5, head_dim=128, wikitext-2):
f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel
In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.
Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.
Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.
Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: InnerQ per-channel equalization + turbo2 64-group fallback
InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).
Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).
Math: <Q/s, s*K> = <Q, K> preserves dot products.
Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity
Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.
Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7
The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.
Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.
Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.
Block format: block_turbo2_0 = 10 bytes per 32 values
- norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
- qs (8B): 2-bit centroid indices, 4 per byte
Usage: -ctk turbo2 -ctv turbo2
Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration
NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.
Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).
WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).
Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)
Users can now mix turbo3 and q8_0 independently for K and V caches:
-ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality
-ctk q8_0 -ctv turbo3 # opposite tradeoff
Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
(fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
WHT must not fire. For MLA, V is a view of K so v->type correctly
reflects K's turbo type.
Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
turbo3/q8_0: 9/11 (1 extra miss, normal variance)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.
Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
so CPU flash attention can handle turbo3 K/V for models where CUDA FA
returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy
Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.
Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})
Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.
Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].
Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.
Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
тАв Shared-memory WHT butterfly (7 stages, no atomics)
тАв Warp-reduce L2 norm + inter-warp accumulate via smem
тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block тАФ load qs/signs once instead of twice per pair
тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75├Ч q8_0)
After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill
Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).
launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically. Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ
Before (VEC only): After (MMA for prefill):
2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0)
8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0)
32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0)
Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).
Decode unchanged (VEC, ~0.64-0.82…
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.
vulkan: TQ4_1s support for model weights (#69)
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)
fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10
The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.
Users can opt-in with TURBO_FLASH=1 for testing/debugging.
No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: gate turbo V unpad on V type, not K type
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.
Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.
Compared to commit 1 (VEC force), this approach recovers prefill:
pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
Decode: unchanged across all configs
Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).
Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.
Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)
Refs: ggml-org/llama.cpp#20969
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: force VEC FA path for quantized KV on HIP/ROCm
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.
This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).
Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.
Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.
Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).
Refs: ggml-org/llama.cpp#20969 (community reports)
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:
* ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
rounding to a runtime SPIR-V patch and dropped the _rte shader
variants plus rte.glsl itself.
* TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
support against a base that still had those variants.
After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.
Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.
Changes:
* Drop the if (float_controls_rte_fp16) / else branches around
cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
variant, matching upstream post-1f30ac0ce.
* Remove the #include "rte.glsl" from copy_to_quant.comp.
* Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
generator (no MMQ code path for it).
* Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
and the _cm2 counterpart alongside the other quant types.
Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):
* Full HIP+Vulkan build is clean with no shader compile errors.
* test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
* test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
530 cases pass (previously aborted on case 3).
* test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
still green (no HIP regression).
* llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
-ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
correctness issue on the Vulkan turbo3 decode path is pre-existing
and orthogonal to this change.
llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:
F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17
Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks
fix nix build: Add spirv-headers to vulkanBuildInputs
ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81
spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a
but not added to the nix build environment
fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu
The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.
The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.
CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.
Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF
Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline
Fixes #83
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.
The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.
Extend the predicate to include f16 and bf16 alongside turbo + q8_0.
Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.
Co-Authored-By: tturney@psyguard.ai
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check
turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: fix turbo build and test failures
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix GGML_OP_COUNT assertion for RPC
When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.
Fix memory explosion on Apple Silicon
Add GitHub Sponsors funding link
vulkan: add turbo3 backend tests
- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
(f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
vulkan: fix and complete turbo3 KV cache support
feat: add CDNA4 (gfx950/MI355X) support + test results
Add AMD Instinct MI355X (gfx950) architecture support:
Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
* bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8: mfma_i32_16x16x32_i8 (same as CDNA3)
* f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)
MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)
Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
docs: add AMD Instinct MI300X (gfx942) ROCm test results
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.
Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)
MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
metal: add TurboFlash attention kernel for turbo3 KV cache decode
Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.
Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
- Each lane handles DK/32 interleaved dimensions
- Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
- K scoring via simd_sum dot product
- turbo3 V unpack with register codebook (8 centroids)
- Online softmax (m/l/o state) entirely in registers
- Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
- Online softmax correction with global max/sum
- Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
- Eliminates 5 of 7 threadgroup barriers vs naive butterfly
Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).
Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)
Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Vulkan compute shader support for turbo3 KV cache
Full turbo3 quantize/dequant pipeline for Vulkan backend:
- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch
Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).
Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.
Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch
Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)
Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.
Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.
For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0
Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).
Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4
Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
to catch silent corruption that PPL alone misses.
Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
Upstream V12 runtime: 67 t/s (4.5 GiB VRAM)
q8_0 conversion: 176 t/s (7.5 GiB VRAM)
Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR
Quality (vs f16 baseline):
PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55)
Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
NIAH: 5/5
Coherence: 4/4 (Paris, 4, print, Shakespeare)
fix: cap map0 kernel shmem for 256-expert MoE models
Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.
Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.
Fixes #58
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
with flash attention тАФ needs chunked map0 dispatch (follow-up)
Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.
Reported by Madreag on RTX 5090 WSL2.
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Enhance Metal operations for TQ weights and concurrency handling for Gemma 4
- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.
Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.
signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).
Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).
Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).
Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix turbo4 C reference WHT dequant mismatch (#43)
Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.
Affects CPU-only inference fallback path. GPU users unaffected.
Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.
Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)
Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).
Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.
Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows
No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.
Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Windows MSVC build compatibility for TQ weight types
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
to fix C vs C++ name mangling across DLL boundary
All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
C2065: 'M_PI': undeclared identifier
LNK2001: unresolved external symbol turbo3_cpu_wht_group_size
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch
Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).
Key changes:
- No global scratch buffer (eliminates CUDA graph incompatibility)
- No separate kernel launch for pre-rotation
- Activation stays in shmem (~14-32 KB) instead of global memory
- Single __syncthreads between rotation and dot product (NOT in inner loop)
- V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)
This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.
Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).
NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.
Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.
Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
vs q8_0 (177 t/s): 39%
PPL: 8.82 (identical)
Comprehensive optimization log (13 versions tested):
cuBLAS baseline: 20 t/s
V1 per-warp WHT, 4 warps: 60 t/s (3.0x)
V3 shmem activation cache: 33 t/s (syncthreads kills it)
V5 multi-warp per row: 62 t/s
V6 LUT (shmem): 37 t/s (sync overhead)
V7 8 warps clean: 62 t/s
V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x)
V9 pre-rot + q8_1: 70 t/s (marginal)
V10 4-elem/thread: 57 t/s
V13 8-elem, 4-thread dot: 45 t/s
NR0=2/4/8: all regressed (register spill / cache thrash)
The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration
- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types
Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
fix: disable upstream attn rotation by default (conflicts with TurboQuant)
Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.
Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add post-unrotate memory barrier for in-layer mixing safety
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.
The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW
Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.
Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch
Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2
Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused CENTROIDS_1BIT constant
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing TurboQuant FA template instances for HIP/ROCm build
The HIP build was missing 9 turbo cross-type flash attention vec
instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.
Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
since those instance files are already excluded from the HIP build
(they exceed HIP's 65536-byte local memory limit).
Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.
Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.
Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.
Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Increase turbo3/turbo2 block size from 32 to 128
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels
Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).
HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
to support both 3-arg and 4-arg calls (CUDA allows defaulting width
to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
on wave64 platforms, turbo code expects 32-bit)
HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
as CUDA CMakeLists, were missing from HIP build)
Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: KV state serialization uses padded tensor width (issue #28 follow-up)
state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.
Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Boundary V (experimental) тАФ layer-aware V compression
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)
The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.
Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.
Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1
Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".
Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).
Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
(turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4
PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).
Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.
Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)
GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.
Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).
ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.
Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
(ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
(always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance
Tested:
GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
DeepSeek: 143 t/s, PPL 11.61 (unchanged)
16/16 coherence, NIAH 9-10/11 at 32K
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention:
- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)
The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.
Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs
Closes #27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: asymmetric K/V quant support for Metal flash attention
- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.
Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.
Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).
Closes #19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)
The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.
All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.
PPL summary (Qwen3.5, head_dim=128, wikitext-2):
f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel
In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.
Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.
Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.
Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: InnerQ per-channel equalization + turbo2 64-group fallback
InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).
Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).
Math: <Q/s, s*K> = <Q, K> preserves dot products.
Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity
Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.
Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7
The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.
Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.
Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.
Block format: block_turbo2_0 = 10 bytes per 32 values
- norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
- qs (8B): 2-bit centroid indices, 4 per byte
Usage: -ctk turbo2 -ctv turbo2
Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration
NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.
Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).
WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).
Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)
Users can now mix turbo3 and q8_0 independently for K and V caches:
-ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality
-ctk q8_0 -ctv turbo3 # opposite tradeoff
Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
(fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
WHT must not fire. For MLA, V is a view of K so v->type correctly
reflects K's turbo type.
Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
turbo3/q8_0: 9/11 (1 extra miss, normal variance)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.
Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
so CPU flash attention can handle turbo3 K/V for models where CUDA FA
returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy
Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.
Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})
Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.
Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].
Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.
Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
тАв Shared-memory WHT butterfly (7 stages, no atomics)
тАв Warp-reduce L2 norm + inter-warp accumulate via smem
тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block тАФ load qs/signs once instead of twice per pair
тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75├Ч q8_0)
After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill
Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).
launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically. Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ
Before (VEC only): After (MMA for prefill):
2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0)
8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0)
32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0)
Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).
Decode unchanged (VEC, ~0.64-0.82…
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.
vulkan: TQ4_1s support for model weights (#69)
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)
fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10
The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.
Users can opt-in with TURBO_FLASH=1 for testing/debugging.
No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: gate turbo V unpad on V type, not K type
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.
Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.
Compared to commit 1 (VEC force), this approach recovers prefill:
pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
Decode: unchanged across all configs
Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).
Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.
Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)
Refs: ggml-org/llama.cpp#20969
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: force VEC FA path for quantized KV on HIP/ROCm
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.
This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).
Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.
Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.
Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).
Refs: ggml-org/llama.cpp#20969 (community reports)
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:
* ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
rounding to a runtime SPIR-V patch and dropped the _rte shader
variants plus rte.glsl itself.
* TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
support against a base that still had those variants.
After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.
Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.
Changes:
* Drop the if (float_controls_rte_fp16) / else branches around
cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
variant, matching upstream post-1f30ac0ce.
* Remove the #include "rte.glsl" from copy_to_quant.comp.
* Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
generator (no MMQ code path for it).
* Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
and the _cm2 counterpart alongside the other quant types.
Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):
* Full HIP+Vulkan build is clean with no shader compile errors.
* test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
* test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
530 cases pass (previously aborted on case 3).
* test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
still green (no HIP regression).
* llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
-ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
correctness issue on the Vulkan turbo3 decode path is pre-existing
and orthogonal to this change.
llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:
F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17
Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks
fix nix build: Add spirv-headers to vulkanBuildInputs
ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81
spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a
but not added to the nix build environment
fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu
The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.
The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.
CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.
Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF
Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline
Fixes #83
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.
The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.
Extend the predicate to include f16 and bf16 alongside turbo + q8_0.
Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.
Co-Authored-By: tturney@psyguard.ai
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check
turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: fix turbo build and test failures
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix GGML_OP_COUNT assertion for RPC
When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.
Fix memory explosion on Apple Silicon
Add GitHub Sponsors funding link
vulkan: add turbo3 backend tests
- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
(f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
vulkan: fix and complete turbo3 KV cache support
feat: add CDNA4 (gfx950/MI355X) support + test results
Add AMD Instinct MI355X (gfx950) architecture support:
Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
* bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8: mfma_i32_16x16x32_i8 (same as CDNA3)
* f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)
MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)
Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
docs: add AMD Instinct MI300X (gfx942) ROCm test results
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.
Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)
MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
metal: add TurboFlash attention kernel for turbo3 KV cache decode
Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.
Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
- Each lane handles DK/32 interleaved dimensions
- Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
- K scoring via simd_sum dot product
- turbo3 V unpack with register codebook (8 centroids)
- Online softmax (m/l/o state) entirely in registers
- Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
- Online softmax correction with global max/sum
- Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
- Eliminates 5 of 7 threadgroup barriers vs naive butterfly
Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).
Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)
Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Vulkan compute shader support for turbo3 KV cache
Full turbo3 quantize/dequant pipeline for Vulkan backend:
- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch
Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).
Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.
Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch
Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)
Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.
Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.
For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0
Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).
Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4
Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
to catch silent corruption that PPL alone misses.
Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
Upstream V12 runtime: 67 t/s (4.5 GiB VRAM)
q8_0 conversion: 176 t/s (7.5 GiB VRAM)
Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR
Quality (vs f16 baseline):
PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55)
Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
NIAH: 5/5
Coherence: 4/4 (Paris, 4, print, Shakespeare)
fix: cap map0 kernel shmem for 256-expert MoE models
Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.
Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.
Fixes #58
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
with flash attention тАФ needs chunked map0 dispatch (follow-up)
Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.
Reported by Madreag on RTX 5090 WSL2.
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Enhance Metal operations for TQ weights and concurrency handling for Gemma 4
- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.
Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.
signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).
Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).
Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).
Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix turbo4 C reference WHT dequant mismatch (#43)
Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.
Affects CPU-only inference fallback path. GPU users unaffected.
Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.
Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)
Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).
Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.
Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows
No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.
Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Windows MSVC build compatibility for TQ weight types
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
to fix C vs C++ name mangling across DLL boundary
All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
C2065: 'M_PI': undeclared identifier
LNK2001: unresolved external symbol turbo3_cpu_wht_group_size
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch
Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).
Key changes:
- No global scratch buffer (eliminates CUDA graph incompatibility)
- No separate kernel launch for pre-rotation
- Activation stays in shmem (~14-32 KB) instead of global memory
- Single __syncthreads between rotation and dot product (NOT in inner loop)
- V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)
This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.
Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).
NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.
Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.
Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
vs q8_0 (177 t/s): 39%
PPL: 8.82 (identical)
Comprehensive optimization log (13 versions tested):
cuBLAS baseline: 20 t/s
V1 per-warp WHT, 4 warps: 60 t/s (3.0x)
V3 shmem activation cache: 33 t/s (syncthreads kills it)
V5 multi-warp per row: 62 t/s
V6 LUT (shmem): 37 t/s (sync overhead)
V7 8 warps clean: 62 t/s
V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x)
V9 pre-rot + q8_1: 70 t/s (marginal)
V10 4-elem/thread: 57 t/s
V13 8-elem, 4-thread dot: 45 t/s
NR0=2/4/8: all regressed (register spill / cache thrash)
The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration
- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types
Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
fix: disable upstream attn rotation by default (conflicts with TurboQuant)
Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.
Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add post-unrotate memory barrier for in-layer mixing safety
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.
The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW
Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.
Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch
Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2
Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused CENTROIDS_1BIT constant
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing TurboQuant FA template instances for HIP/ROCm build
The HIP build was missing 9 turbo cross-type flash attention vec
instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.
Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
since those instance files are already excluded from the HIP build
(they exceed HIP's 65536-byte local memory limit).
Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.
Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.
Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.
Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Increase turbo3/turbo2 block size from 32 to 128
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels
Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).
HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
to support both 3-arg and 4-arg calls (CUDA allows defaulting width
to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
on wave64 platforms, turbo code expects 32-bit)
HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
as CUDA CMakeLists, were missing from HIP build)
Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: KV state serialization uses padded tensor width (issue #28 follow-up)
state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.
Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Boundary V (experimental) тАФ layer-aware V compression
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)
The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.
Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.
Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1
Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".
Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).
Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
(turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4
PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).
Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.
Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)
GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.
Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).
ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.
Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
(ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
(always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance
Tested:
GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
DeepSeek: 143 t/s, PPL 11.61 (unchanged)
16/16 coherence, NIAH 9-10/11 at 32K
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention:
- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)
The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.
Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs
Closes #27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: asymmetric K/V quant support for Metal flash attention
- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.
Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.
Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).
Closes #19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)
The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.
All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.
PPL summary (Qwen3.5, head_dim=128, wikitext-2):
f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel
In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.
Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.
Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.
Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: InnerQ per-channel equalization + turbo2 64-group fallback
InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).
Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).
Math: <Q/s, s*K> = <Q, K> preserves dot products.
Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity
Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.
Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7
The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.
Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.
Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.
Block format: block_turbo2_0 = 10 bytes per 32 values
- norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
- qs (8B): 2-bit centroid indices, 4 per byte
Usage: -ctk turbo2 -ctv turbo2
Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration
NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.
Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).
WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).
Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)
Users can now mix turbo3 and q8_0 independently for K and V caches:
-ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality
-ctk q8_0 -ctv turbo3 # opposite tradeoff
Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
(fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
WHT must not fire. For MLA, V is a view of K so v->type correctly
reflects K's turbo type.
Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
turbo3/q8_0: 9/11 (1 extra miss, normal variance)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.
Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
so CPU flash attention can handle turbo3 K/V for models where CUDA FA
returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy
Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.
Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})
Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.
Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].
Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.
Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
тАв Shared-memory WHT butterfly (7 stages, no atomics)
тАв Warp-reduce L2 norm + inter-warp accumulate via smem
тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block тАФ load qs/signs once instead of twice per pair
тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75├Ч q8_0)
After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill
Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).
launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically. Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ
Before (VEC only): After (MMA for prefill):
2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0)
8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0)
32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0)
Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).
Decode unchanged (VEC, ~0.64-0.82…
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.
vulkan: TQ4_1s support for model weights (#69)
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)
fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10
The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.
Users can opt-in with TURBO_FLASH=1 for testing/debugging.
No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: gate turbo V unpad on V type, not K type
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.
Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.
Compared to commit 1 (VEC force), this approach recovers prefill:
pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
Decode: unchanged across all configs
Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).
Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.
Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)
Refs: ggml-org/llama.cpp#20969
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: force VEC FA path for quantized KV on HIP/ROCm
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.
This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).
Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.
Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.
Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).
Refs: ggml-org/llama.cpp#20969 (community reports)
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:
* ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
rounding to a runtime SPIR-V patch and dropped the _rte shader
variants plus rte.glsl itself.
* TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
support against a base that still had those variants.
After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.
Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.
Changes:
* Drop the if (float_controls_rte_fp16) / else branches around
cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
variant, matching upstream post-1f30ac0ce.
* Remove the #include "rte.glsl" from copy_to_quant.comp.
* Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
generator (no MMQ code path for it).
* Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
and the _cm2 counterpart alongside the other quant types.
Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):
* Full HIP+Vulkan build is clean with no shader compile errors.
* test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
* test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
530 cases pass (previously aborted on case 3).
* test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
still green (no HIP regression).
* llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
-ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
correctness issue on the Vulkan turbo3 decode path is pre-existing
and orthogonal to this change.
llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:
F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17
Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks
fix nix build: Add spirv-headers to vulkanBuildInputs
ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81
spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a
but not added to the nix build environment
fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu
The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.
The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.
CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.
Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF
Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline
Fixes #83
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.
The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.
Extend the predicate to include f16 and bf16 alongside turbo + q8_0.
Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.
Co-Authored-By: tturney@psyguard.ai
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check
turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: fix turbo build and test failures
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix GGML_OP_COUNT assertion for RPC
When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.
Fix memory explosion on Apple Silicon
Add GitHub Sponsors funding link
vulkan: add turbo3 backend tests
- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
(f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
vulkan: fix and complete turbo3 KV cache support
feat: add CDNA4 (gfx950/MI355X) support + test results
Add AMD Instinct MI355X (gfx950) architecture support:
Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
* bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8: mfma_i32_16x16x32_i8 (same as CDNA3)
* f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)
MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)
Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
docs: add AMD Instinct MI300X (gfx942) ROCm test results
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.
Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)
MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
metal: add TurboFlash attention kernel for turbo3 KV cache decode
Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.
Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
- Each lane handles DK/32 interleaved dimensions
- Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
- K scoring via simd_sum dot product
- turbo3 V unpack with register codebook (8 centroids)
- Online softmax (m/l/o state) entirely in registers
- Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
- Online softmax correction with global max/sum
- Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
- Eliminates 5 of 7 threadgroup barriers vs naive butterfly
Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).
Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)
Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Vulkan compute shader support for turbo3 KV cache
Full turbo3 quantize/dequant pipeline for Vulkan backend:
- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch
Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).
Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.
Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch
Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)
Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.
Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.
For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0
Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).
Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4
Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
to catch silent corruption that PPL alone misses.
Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
Upstream V12 runtime: 67 t/s (4.5 GiB VRAM)
q8_0 conversion: 176 t/s (7.5 GiB VRAM)
Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR
Quality (vs f16 baseline):
PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55)
Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
NIAH: 5/5
Coherence: 4/4 (Paris, 4, print, Shakespeare)
fix: cap map0 kernel shmem for 256-expert MoE models
Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.
Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.
Fixes #58
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
with flash attention тАФ needs chunked map0 dispatch (follow-up)
Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.
Reported by Madreag on RTX 5090 WSL2.
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Enhance Metal operations for TQ weights and concurrency handling for Gemma 4
- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.
Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.
signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).
Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).
Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).
Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix turbo4 C reference WHT dequant mismatch (#43)
Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.
Affects CPU-only inference fallback path. GPU users unaffected.
Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.
Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)
Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).
Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.
Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows
No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.
Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Windows MSVC build compatibility for TQ weight types
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
to fix C vs C++ name mangling across DLL boundary
All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
C2065: 'M_PI': undeclared identifier
LNK2001: unresolved external symbol turbo3_cpu_wht_group_size
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch
Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).
Key changes:
- No global scratch buffer (eliminates CUDA graph incompatibility)
- No separate kernel launch for pre-rotation
- Activation stays in shmem (~14-32 KB) instead of global memory
- Single __syncthreads between rotation and dot product (NOT in inner loop)
- V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)
This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.
Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).
NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.
Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.
Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
vs q8_0 (177 t/s): 39%
PPL: 8.82 (identical)
Comprehensive optimization log (13 versions tested):
cuBLAS baseline: 20 t/s
V1 per-warp WHT, 4 warps: 60 t/s (3.0x)
V3 shmem activation cache: 33 t/s (syncthreads kills it)
V5 multi-warp per row: 62 t/s
V6 LUT (shmem): 37 t/s (sync overhead)
V7 8 warps clean: 62 t/s
V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x)
V9 pre-rot + q8_1: 70 t/s (marginal)
V10 4-elem/thread: 57 t/s
V13 8-elem, 4-thread dot: 45 t/s
NR0=2/4/8: all regressed (register spill / cache thrash)
The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration
- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types
Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
fix: disable upstream attn rotation by default (conflicts with TurboQuant)
Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.
Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add post-unrotate memory barrier for in-layer mixing safety
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.
The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW
Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.
Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch
Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2
Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused CENTROIDS_1BIT constant
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing TurboQuant FA template instances for HIP/ROCm build
The HIP build was missing 9 turbo cross-type flash attention vec
instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.
Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
since those instance files are already excluded from the HIP build
(they exceed HIP's 65536-byte local memory limit).
Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.
Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.
Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.
Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Increase turbo3/turbo2 block size from 32 to 128
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels
Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).
HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
to support both 3-arg and 4-arg calls (CUDA allows defaulting width
to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
on wave64 platforms, turbo code expects 32-bit)
HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
as CUDA CMakeLists, were missing from HIP build)
Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: KV state serialization uses padded tensor width (issue #28 follow-up)
state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.
Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Boundary V (experimental) тАФ layer-aware V compression
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)
The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.
Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.
Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1
Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".
Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).
Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
(turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4
PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).
Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.
Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)
GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.
Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).
ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.
Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
(ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
(always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance
Tested:
GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
DeepSeek: 143 t/s, PPL 11.61 (unchanged)
16/16 coherence, NIAH 9-10/11 at 32K
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention:
- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)
The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.
Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs
Closes #27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: asymmetric K/V quant support for Metal flash attention
- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.
Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.
Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).
Closes #19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)
The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.
All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.
PPL summary (Qwen3.5, head_dim=128, wikitext-2):
f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel
In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.
Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.
Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.
Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: InnerQ per-channel equalization + turbo2 64-group fallback
InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).
Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).
Math: <Q/s, s*K> = <Q, K> preserves dot products.
Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity
Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.
Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7
The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.
Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.
Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.
Block format: block_turbo2_0 = 10 bytes per 32 values
- norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
- qs (8B): 2-bit centroid indices, 4 per byte
Usage: -ctk turbo2 -ctv turbo2
Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration
NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.
Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).
WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).
Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)
Users can now mix turbo3 and q8_0 independently for K and V caches:
-ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality
-ctk q8_0 -ctv turbo3 # opposite tradeoff
Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
(fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
WHT must not fire. For MLA, V is a view of K so v->type correctly
reflects K's turbo type.
Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
turbo3/q8_0: 9/11 (1 extra miss, normal variance)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.
Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
so CPU flash attention can handle turbo3 K/V for models where CUDA FA
returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy
Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.
Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})
Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.
Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].
Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.
Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
тАв Shared-memory WHT butterfly (7 stages, no atomics)
тАв Warp-reduce L2 norm + inter-warp accumulate via smem
тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block тАФ load qs/signs once instead of twice per pair
тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75├Ч q8_0)
After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill
Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).
launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically. Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ
Before (VEC only): After (MMA for prefill):
2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0)
8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0)
32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0)
Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).
Decode unchanged (VEC, ~0.64-0.82…
Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push.
Creates GitHub release with prebuilt binaries.
vulkan: TQ4_1s support for model weights (#69)
* vulkan: add TQ4_1S weight compression support
Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight
compression with 16 Lloyd-Max centroids, 32-element blocks).
Shaders:
- dequant_tq4_1s.comp: standalone dequant with WHT inverse via
subgroupShuffleXor (32-thread workgroup, 5-stage butterfly)
- mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline
activation pre-rotation (forward RHT on activation, centroid*scale
dequant without inverse RHT)
- copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse
- copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward
RHT, dual half-block RMS scales, 16-centroid quantization
- types.glsl: block_tq4_1s struct (d0, d1, qs[16])
- dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT)
Pipeline wiring (ggml-vulkan.cpp):
- MUL_MAT, SET_ROWS, CPY supports_op
- pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32
- Specialized MUL_MAT_VEC with forced subgroup workgroup size
Tests:
- test_set_rows_tq4_1s: SET_ROWS round-trip validation
* vulkan: add fused mul_mat_vec kernel for TQ4_1S
Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the
per-decode-step matrix-vector product no longer has to dequant the
full weight tensor to f16 and then go through the generic matmul
path. The kernel pre-rotates the activation via a forward
Walsh-Hadamard Transform in shared memory and dot-products against
the raw centroid*scale stored weights, folding the inverse-WHT on
the weight side into the activation by the symmetry H = H^T.
Math:
w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k]
sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j]
Portability choices:
- Workgroup size is pinned to 32 threads regardless of the
DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for
the current architecture. The butterfly operates on 32-element
blocks with one element per thread; that contract is fixed by the
quantization format, not by the GPU. Earlier revisions used
`gl_WorkGroupSize.x` as the stride unit, which silently skipped
half the work on Intel drivers that force the subgroup to 16
(tests passed via NMSE tolerance while real inference output was
garbage).
- Butterfly implementation is shared memory only. A subgroup-shuffle
variant (`subgroupShuffleXor`) was prototyped and measured on Intel
Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the
explicit shared-memory butterfly, because Mesa emulates subgroup
shuffles via LDS and ends up doing the same LDS traffic with extra
driver overhead. The shared-memory butterfly is correct on every
device regardless of subgroup-op support, is the fastest path on
every device we can actually measure, and leaves the
`pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform
across all DMMV_WG_SIZE buckets.
- Reduction is the shared-memory tree reduction (no subgroupAdd), for
the same reason: on Intel Arc the subgroupAdd is also LDS-backed
and the hybrid reduction path was measurably slower. Future
vendor-specific heuristics can switch to the hybrid or pure-subgroup
reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops
turn out to beat the LDS roundtrip there; the existing reduction
modes in `mul_mat_vec_base.glsl` already provide the necessary
variants.
- NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows
per workgroup. Each thread holds one position of each of the 8
weight blocks and pairs them with the shared rotated activation.
- `mul_mm` and `flash_attn_cm2` shader generation is skipped for
TQ4_1S because it is a weight-only format that never reaches the
coopmat2 matmul or the KV cache flash-attention paths.
Tests:
- `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01
NMSE so real defects can't hide behind a loose check.
- Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage
(k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536,
2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test
cases total.
Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG,
`llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512
--chunks 20 wiki.test.raw`):
| Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 |
|---------------|---------|----------:|----------:|-------:|---------:|---------:|
| Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% |
| Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% |
| Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% |
| Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% |
Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I:
the compressed model fits in less VRAM, and on a small model the
TQ4_1S compute cost is offset by the reduced memory traffic.
All four models produce coherent output end-to-end and the
reductions line up with the TurboQuant paper's validation matrix
(┬з5.8). The remaining gap to Q8_0 on the bigger models is
compute-bound on the A380; it closes further on GPUs with more raw
throughput.
* vulkan: restructure TQ4_1S inner loop for cross-row smem reuse
Splits the dequant+accumulate phase into two sub-loops:
1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup +
scale multiply, reads from weight buffer only).
2. Read the rotated activation from shared memory ONCE per column,
then FMA across all rows in a tight register loop.
This is the Vulkan analogue of the 'hot loop load dedup' from the
CUDA kernel (PR #57 optimisation #2). It makes the shared memory
read explicitly loop-invariant across rows, which helps compilers
that don't auto-hoist LDS loads out of unrolled loops.
Measured effect on Intel Arc A380 (Llama-3.2-3B premium,
llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise
but not a regression). The structure is cleaner regardless and
should benefit architectures with higher LDS latency.
fix: inverse WHT in test-turbo-quant.c round-trip (#59)
fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10
The TurboFlash two-pass fused attention kernel produces garbage output
on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by
default routes turbo3 through the standard FA path which works correctly.
Users can opt-in with TURBO_FLASH=1 for testing/debugging.
No perf regression тАФ standard FA path matches TurboFlash speed within
noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: gate turbo V unpad on V type, not K type
fix(hip): bypass pool for FA f16 temp buffers to prevent OOM
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations
permanently. For quantized KV flash attention, the f16 dequant temp buffers
stay allocated in the pool after use, consuming more VRAM than the KV
compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4)
to OOM before f16 at equivalent context lengths on HIP/ROCm.
Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with
hipFree (via RAII destructor) instead of the pool. Memory is released
after the FA kernel completes via hipStreamSynchronize.
Compared to commit 1 (VEC force), this approach recovers prefill:
pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169%
pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203%
Decode: unchanged across all configs
Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k).
Can be eliminated in future with hipFreeAsync (stream-ordered free).
Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never
calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory.
On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM.
Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1)
Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP)
Refs: ggml-org/llama.cpp#20969
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: force VEC FA path for quantized KV on HIP/ROCm
The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers
proportional to full KV cache length for any quantized KV type.
On ROCm/HIP these pool allocations persist at peak size, meaning
the temp buffer VRAM exceeds the savings from KV compression.
This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM
before f16 at the same context length. Confirmed on gfx1100
(RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock
llama.cpp q4_0/q8_0 (not TurboQuant-specific).
Fix: on HIP, force VEC path for quantized KV when available
(head_dim <= 256). VEC does inline dequant with no temp buffer.
Trade-off: prefill throughput may decrease (VEC processes queries
sequentially). Decode is unaffected since VEC was already selected
for single-token generation.
Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512)
cannot use VEC and still routes through TILE. Bounded temp buffer
in a separate compilation unit is the proper fix for those cases
(see domvox/llama.cpp-turboquant-hip a13c3db12 discussion).
Refs: ggml-org/llama.cpp#20969 (community reports)
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vulkan: fix turbo3 build + coopmat FA after April upstream sync
Origin's April upstream-sync rebase interleaved two changes that left the
Vulkan turbo3 KV path broken:
* ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE
rounding to a runtime SPIR-V patch and dropped the _rte shader
variants plus rte.glsl itself.
* TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV
support against a base that still had those variants.
After the rebase, the tree had dangling cpy_f32_*_rte_len / _data
references, a two-arg SET_ROWS macro called with one arg, a
MMQ shader variants generated for turbo3_0 even though the flash_attn
MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp
failed to compile on a clean checkout (spirv-headers + all of the
above) and the shader-gen emitted garbage variants.
Separately, turbo3 flash-attn pipelines were only wired up for
FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the
tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in
ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0})
and tripped the Br == wg_denoms[0] assertion as soon as a prefill
ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3
aborted on the first real forward pass.
Changes:
* Drop the if (float_controls_rte_fp16) / else branches around
cpy_f32_quant pipeline creation and collapse SET_ROWS to a single
variant, matching upstream post-1f30ac0ce.
* Remove the #include "rte.glsl" from copy_to_quant.comp.
* Skip the MMQ flash_attn shader variant for turbo3_0 in the shader
generator (no MMQ code path for it).
* Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1)
and the _cm2 counterpart alongside the other quant types.
Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan
1.4.341, spirv-headers 1.4.341.0):
* Full HIP+Vulkan build is clean with no shader compile errors.
* test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147
* test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 :
530 cases pass (previously aborted on case 3).
* test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 :
still green (no HIP regression).
* llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3
-ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128
correctness issue on the Vulkan turbo3 decode path is pre-existing
and orthogonal to this change.
llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend:
F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17
Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81
fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks
fix nix build: Add spirv-headers to vulkanBuildInputs
ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81
spirv-headers were introduced by
https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a
but not added to the nix build environment
fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu
The flash attention vector dispatcher was missing instance files, extern
declarations, and dispatch cases for F16-K + TURBO-V combinations. Any
of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and
sm_121 because no dispatch case matched.
The kernel already treats TURBO types as unquantized alongside F16 and
BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a
kernel limitation. Pure additive change.
CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the
default) also link the new template instances.
Validated on:
- RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON
- DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF
Key results:
- Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89)
- Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89
- sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline)
- sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations
- Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error)
- KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline
Fixes #83
fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82)
The FA dispatcher rejected any K != V type combo unless all types were
in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0`
fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON.
The vec template instances for f16/bf16 + q8_0 are already compiled
(fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the
dispatcher was gating kernels that do exist.
Extend the predicate to include f16 and bf16 alongside turbo + q8_0.
Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where
`-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback.
Co-Authored-By: tturney@psyguard.ai
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add TURBO2_0 to flash_attn auto-enable check
turbo2 V cache failed with "failed to create context" because the
auto-enable predicate only listed turbo3/turbo4. Without auto-enable,
the subsequent quantized-V-requires-FA check hard-fails.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ci: fix turbo build and test failures
Four changes that together make the TheTom/llama-cpp-turboquant CI green
on top of feature/turboquant-kv-cache:
1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp
The abort branch's -Werror=switch was missing the three TurboQuant KV
cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build
after the turbo types were added. Added them next to the TQ* cases so
the fall-through to GGML_ABORT stays consistent.
2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert
GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC
static_assert was not updated. Bumped the patch version 1->2 and the
assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build.
3. tests/test-quantize-fns.cpp
a) Heap-buffer-overflow (ASan trace): the scratch buffers
std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when
vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants),
because from_float then writes test_size*sizeof(float) bytes into
them. Size buffers to std::max(2*test_size, test_size*sizeof(float))
so every quant's from_float fits.
b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their
dequantize_row_turbo*_0 intentionally stays in the WHT-rotated
domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT
in the attention graph, so a round-trip against the original float
buffer is not meaningful and the existing threshold table does not
apply.
c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses
MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses
MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to
3-bit error budgets instead of the default 4-bit ones.
4. ggml-cuda/turbo-quant.cuh
Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize
calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]])
no longer trips -Werror on unused-result.
Verified locally on Arch Linux (CPU build):
ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs"
=> 100% tests passed, 0 tests failed out of 41
(the excluded tokenizer test needs the model-fetch fixture)
test-quantize-fns runs clean under AddressSanitizer as well.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix GGML_OP_COUNT assertion for RPC
When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97.
Fix memory explosion on Apple Silicon
Add GitHub Sponsors funding link
vulkan: add turbo3 backend tests
- test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5
(f32 SIMD reduction order varies across GPU backends).
- test_turbo_wht_roundtrip: forward then inverse recovers original, 9
configs. NMSE tolerance 1e-5.
- test_set_rows_turbo3: full quantization round-trip at small and large
tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs.
- Existing: test_turbo_wht (18), FA with turbo3 KV (528).
- Total: 576 tests.
vulkan: fix and complete turbo3 KV cache support
feat: add CDNA4 (gfx950/MI355X) support + test results
Add AMD Instinct MI355X (gfx950) architecture support:
Code changes:
- vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family
- common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro
- mma.cuh: Route CDNA4 to compatible MFMA instructions
* bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3)
* int8: mfma_i32_16x16x32_i8 (same as CDNA3)
* f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950)
- mmq.cuh: Include CDNA4 in stream-k dispatch
- common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn)
MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU):
- turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%)
- turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%)
- WHT roundtrip: PASS (max error 2.98e-07)
Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported).
TurboQuant types force FA and work correctly.
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
docs: add AMD Instinct MI300X (gfx942) ROCm test results
TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs
correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes
required тАФ existing CUDA kernels compile via HIP translation.
Test results (Qwen2.5-1.5B Q4_K_M, single MI300X):
- WHT roundtrip: PASS (max error 2.98e-07)
- turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s)
- turbo3 decode: 88% of f16 (160 vs 181 tok/s)
- turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s)
- turbo4 decode: 89% of f16 (161 vs 181 tok/s)
MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's
MMQ kernel dispatch (upstream issue, not TurboQuant-specific).
Tested-by: Andy Luo <andyluo7@users.noreply.github.com>
metal: add TurboFlash attention kernel for turbo3 KV cache decode
Two-pass block-parallel attention kernel optimized for turbo3 V cache
decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and
turbo3-K (symmetric) configurations via compile-time function constant.
Architecture:
- Pass 1: 32-thread SIMD group per (query-head, block) pair
- Each lane handles DK/32 interleaved dimensions
- Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path
- K scoring via simd_sum dot product
- turbo3 V unpack with register codebook (8 centroids)
- Online softmax (m/l/o state) entirely in registers
- Zero shared memory in pass 1
- Pass 2: merge partial results across blocks
- Online softmax correction with global max/sum
- Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6)
- Eliminates 5 of 7 threadgroup barriers vs naive butterfly
Auto-detection: activates for single-token decode (ne01==1) when V is
turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var
(0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon).
Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V):
- M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0
- M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s)
- Advantage scales with context (+7.3% at 32K)
Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Vulkan compute shader support for turbo3 KV cache
Full turbo3 quantize/dequant pipeline for Vulkan backend:
- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch
Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs
The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput
(240 t/s on 5090). On RDNA4, sudot4 has different throughput
characteristics and the q8_1 activation quantization adds overhead,
causing a regression vs the V12 float kernel (101 vs 135 t/s on
RX 9070 XT).
Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD
GPUs to a scalar half-precision kernel (same pattern as TQ3_1S).
NVIDIA continues using the dp4a path.
Changes:
- Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations,
shmem centroid LUT, scalar dot product (no dp4a/byte_perm)
- Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path.
- LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch
Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0)
instead of dp4a regression (101 t/s, 60% of Q8_0).
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility
feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix
- Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches)
loads weight data once per block, reuses across all ncols_dst tokens
- Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill
- Fix multi-GPU crash: replace static global CUDA buffers with per-device
pool allocations from ctx.pool(id), matching mmvq.cu pattern
- Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED
perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block)
Replace per-element generic dequant template (which repeats the full
32-element WHT butterfly 16 times per block) with a warp-cooperative
version using __shfl_xor_sync. One WHT per block instead of 16.
Note: this improves the dequant kernel itself but doesn't fix the
prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs
the q8_0 conversion path's native int8 tensor core GEMM. The dequant
was never the slow part тАФ the GEMM dispatch is fundamentally different.
For prefill-heavy workloads, load-time q8_0 conversion remains the
recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy
interactive chat where the +29% decode speed matters more.
perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0
Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel.
Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time
conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB).
Key optimizations (found via 86 automated experiments):
1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck)
2. Shared-memory centroid LUT тАФ eliminates constant memory serialization
on divergent lane access (+89% single change)
3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density
4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%)
5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup
6. NWARPS 8тЖТ4
Also:
- Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead
of default. Native kernel is strictly better on both speed and VRAM.
- Autoresearch harness gains coherence testing (server API + factual Q&A)
to catch silent corruption that PPL alone misses.
Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S):
Upstream V12 runtime: 67 t/s (4.5 GiB VRAM)
q8_0 conversion: 176 t/s (7.5 GiB VRAM)
Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR
Quality (vs f16 baseline):
PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55)
Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078)
NIAH: 5/5
Coherence: 4/4 (Paris, 4, print, Shakespeare)
fix: cap map0 kernel shmem for 256-expert MoE models
Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB),
exceeding the 32KB threadgroup memory limit on Apple Silicon.
At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem
= 256*8*2 = 4KB, well within limits.
Cap the reservation shmem to 32KB to prevent the assert from firing.
Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash
attention тАФ previously crashed during warmup, now runs at 22 t/s.
Fixes #58
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: add MoE expert count kernel instantiations + TQ4_1S backend tests
Metal MoE support:
- Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256
- Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4,
Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B
- Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server
with flash attention тАФ needs chunked map0 dispatch (follow-up)
Backend tests:
- Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops
- Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size
extern "C" GGML_API creates double extern on paths where GGML_API
expands to 'extern'. Wrap in extern "C" {} block instead.
Reported by Madreag on RTX 5090 WSL2.
Co-Authored-By: Tom Turney <tturney1@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES.
Enhance Metal operations for TQ weights and concurrency handling for Gemma 4
- Introduced a new function to check if TQ source tensors mutate during operations.
- Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior.
- Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters.
- Enhanced comments for clarity on concurrency issues related to TQ weights.
fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error)
GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the
GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang
accept it silently.
Reported independently by joemc1470 (RX 6600 HIP) and
christopheraleman1015 (WSL2 Ubuntu).
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed
Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are
converted to q8_0 at model load via fused CUDA kernel. Transparent to
downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed.
signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0.
PPL 7.608 vs 7.599 (0.009 requantization rounding).
Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types).
Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max).
Co-Authored-By: Gabe Ortiz <gabe@signalnine.net>
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Fix turbo4 C reference WHT dequant mismatch (#43)
Fix/turbo4 wht dequant
fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models)
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.
Affects CPU-only inference fallback path. GPU users unaffected.
Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.
Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support)
Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).
Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.
Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: AMD HIP/ROCm build support for TQ4_1S weight compression
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows
No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.
Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: Windows MSVC build compatibility for TQ weight types
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
to fix C vs C++ name mangling across DLL boundary
All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
C2065: 'M_PI': undeclared identifier
LNK2001: unresolved external symbol turbo3_cpu_wht_group_size
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch
Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).
Key changes:
- No global scratch buffer (eliminates CUDA graph incompatibility)
- No separate kernel launch for pre-rotation
- Activation stays in shmem (~14-32 KB) instead of global memory
- Single __syncthreads between rotation and dot product (NOT in inner loop)
- V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)
This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync
inside the dot product loop. V12's sync is between the two phases.
Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).
NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.
Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.
Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale
only, zero WHT per block).
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
Decode: 20.3 тЖТ 69 t/s (3.4x speedup)
vs q8_0 (177 t/s): 39%
PPL: 8.82 (identical)
Comprehensive optimization log (13 versions tested):
cuBLAS baseline: 20 t/s
V1 per-warp WHT, 4 warps: 60 t/s (3.0x)
V3 shmem activation cache: 33 t/s (syncthreads kills it)
V5 multi-warp per row: 62 t/s
V6 LUT (shmem): 37 t/s (sync overhead)
V7 8 warps clean: 62 t/s
V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x)
V9 pre-rot + q8_1: 70 t/s (marginal)
V10 4-elem/thread: 57 t/s
V13 8-elem, 4-thread dot: 45 t/s
NR0=2/4/8: all regressed (register spill / cache thrash)
The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing тАФ TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration
- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types
Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)
fix: disable upstream attn rotation by default (conflicts with TurboQuant)
Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.
Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add post-unrotate memory barrier for in-layer mixing safety
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.
The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW
Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative
refinement (6 iter) тЖТ pack indices.
Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch
Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2
Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL тАФ needs investigation
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove unused CENTROIDS_1BIT constant
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add missing TurboQuant FA template instances for HIP/ROCm build
The HIP build was missing 9 turbo cross-type flash attention vec
instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.
Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
since those instance files are already excluded from the HIP build
(they exceed HIP's 65536-byte local memory limit).
Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.
Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2)
The block_size=128 change (adac2c68d) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.
Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.
Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Increase turbo3/turbo2 block size from 32 to 128
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels
Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).
HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
to support both 3-arg and 4-arg calls (CUDA allows defaulting width
to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
on wave64 platforms, turbo code expects 32-bit)
HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
as CUDA CMakeLists, were missing from HIP build)
Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298bd).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: KV state serialization uses padded tensor width (issue #28 follow-up)
state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.
Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: Boundary V (experimental) тАФ layer-aware V compression
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28)
The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the
KV cache zero-padding code could run.
Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.
Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1
Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".
Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).
Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
(turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4
PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2)
Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel
instances, causing ~11x prefill regression (falling back to CPU FA).
Added VEC template instances for both cross-type pairs at D=64/128/256.
Updated the mixed-type guard in get_best_fattn_kernel to allow any
combination of turbo2, turbo3, and q8_0.
Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC
speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s)
GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT.
D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s.
Added MMA flash attention configs for D=640 (DKQ=640, DV=512),
using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256).
ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory
(ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block
limit. ncols1тЙд2 keeps total shared memory under 80KB.
Changes:
- fattn-mma-f16.cuh: D=640 config entries + extern declarations
(ncols2=16 only, ncols1тИИ{1,2,4})
- fattn.cu: case 640 in kernel selection + simplified MMA dispatch
(always ncols2=16, ncols1 capped at 2)
- fattn-tile.cuh/cu: D=640 tile configs + dispatch
- 3 MMA template instances + 1 tile instance
Tested:
GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16)
Qwen3.5: 222 t/s, PPL 6.31 (unchanged)
DeepSeek: 143 t/s, PPL 11.61 (unchanged)
16/16 coherence, NIAH 9-10/11 at 32K
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression)
head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations
Add full asymmetric K/V quantization support for Metal flash attention:
- Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total),
eliminating underscore ambiguity in type names
- 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations)
- 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims)
- Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs
- Zero regression on existing symmetric paths (validated across 4 models, 2 machines)
The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V
configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables
the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on
low-bit models where symmetric turbo-K degrades.
Validated on Metal (M2 Pro + M5 Max):
- phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression)
- Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued)
- Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression)
- Cross-hardware M2/M5 parity confirmed on all tested configs
Closes #27
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: asymmetric K/V quant support for Metal flash attention
- Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4)
- Both vec and non-vec FA paths, all supported head dims
- Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid
underscore ambiguity in type names
- Relax symmetric K/V assertion to allow turbo mixed pairs
- Update pipeline name construction in both FA pipeline getters
- 90 new kernel instantiations (54 non-vec + 36 vec)
- All 335 FA kernels use unified k{type}_v{type} naming scheme
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression)
4 centroids, no signs byte. Simpler than turbo3.
PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B.
Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel,
FA template instantiations (non-vec nl=2, vec nl=8).
Runtime: supports_op, vec/non-vec gating, llama-bench parser.
Codex-reviewed: no bugs found, only known limitation is
non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback).
Closes #19
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads)
causes catastrophic quality loss on some models:
- DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline
- GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%)
The 64-group WHT passed NIAH/coherence but the weaker decorrelation
(6 stages, 3 groups per 192-dim head) doesn't preserve statistical
quality. Straight PolarQuant (no WHT) is even worse.
All turbo types now require head_dim divisible by 128. Models with
non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to
q8_0 with a warning. The 64-group WHT code remains for future
investigation but is not used.
PPL summary (Qwen3.5, head_dim=128, wikitext-2):
f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel
In the VEC flash-attention kernel, skip V dequantization for KV
positions where ALL attention weights are below 1e-6. At long context
most positions contribute noise, not signal тАФ skipping their V dequant
saves compute with zero quality impact.
Both half2 and f32 paths are covered. The threshold (1e-6) matches
the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below
this are flushed to zero.
Benefit scales with context length тАФ at short context nearly every
position contributes so no skip occurs. At 512K+ most positions are
skipped during generation.
Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence.
Generation at 512K: 25 t/s, prefill: 2535 t/s.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: InnerQ per-channel equalization + turbo2 64-group fallback
InnerQ equalizes K channel variances before WHT rotation to reduce
quantization error on models with anisotropic K distributions.
Enabled via TURBO_INNERQ=N env var (N = calibration token count).
Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ
apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and
V output via ggml_turbo_wht op (scale passed as src[1]).
Math: <Q/s, s*K> = <Q, K> preserves dot products.
Key design:
- d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never
multiplies by uninitialized scales
- Auto-disables for 64-group models (group_size < 128) where
channel mapping is not 1:1 across WHT groups
- Auto-disables when max channel ratio < 1.2 (already balanced)
- Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv
tensor updates between SET_ROWS and graph-side WHT
- Scale_inv tensor stored in KV cache, initialized to identity
Also: turbo2 (2-bit) now requires head_dim divisible by 128.
At 64-group WHT, the 4-centroid codebook doesn't have enough
resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2
auto-falls back to q8_0 for non-128-aligned heads with a warning.
Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active
on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7
The mixed KV types commit changed the inverse WHT guard from checking
k->type to v->type. For MLA, V is a view of K's first 512 elements
(v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with
64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch
produced garbage output on GLM-4.7 Flash.
Fix: derive group_size from K when K is turbo (since K determines the
rotation used during quantization), from V only when K is not turbo.
Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression)
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value,
6.4x compression vs f16. Same architecture as turbo3 but with a
4-centroid Lloyd-Max codebook instead of 8.
Block format: block_turbo2_0 = 10 bytes per 32 values
- norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm)
- qs (8B): 2-bit centroid indices, 4 per byte
Usage: -ctk turbo2 -ctv turbo2
Full stack:
- ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct
- ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize
- turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant
- set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT)
- dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion
- fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2
- fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances
- Mixed types: turbo2/q8_0 cross-type FA instances
- common/arg.cpp: CLI --cache-type-k turbo2
- llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration
NIAH (Qwen3.5 35B, RTX 5090, 4K-512K):
4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11
Coherence: 31/31 across 4 models x all KV combos (GPU + CPU)
Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs)
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and
zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced
garbage output.
Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT
rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ
pack qs/signs тЖТ corrected norm (grp_norm / recon_norm).
WHT group size (128 or 64) is communicated from the CPU SET_ROWS
handler via a global variable, read from the op_params that
llama-kv-cache.cpp sets based on head_dim. This handles models with
different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128).
Tested CPU-only (-ngl 0):
- Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s
- DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s
- GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa)
Users can now mix turbo3 and q8_0 independently for K and V caches:
-ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality
-ctk q8_0 -ctv turbo3 # opposite tradeoff
Changes:
- New VEC FA template instances for turbo3/q8_0 cross-type pairs
(fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu)
- fattn-vec.cuh: extern declarations for mixed instances
- fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard
- llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ
when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse
WHT must not fire. For MLA, V is a view of K so v->type correctly
reflects K's turbo type.
Tested:
- 16/16 coherence tests pass (4 models ├Ч 4 KV combos)
- NIAH at 4K+32K: all combos within baseline variance (10-11/11)
q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11,
turbo3/q8_0: 9/11 (1 extra miss, normal variance)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
feat: 64-element WHT groups + MLA Q rotation fix (issue #13)
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix тАФ missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash,
DeepSeek2 MLA with head_dim=192/576) previously crashed with
GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention.
Changes:
- llama-kv-cache.cpp: detect incompatible head_dim at KV cache init,
log a warning, and auto-fall back to q8_0 instead of crashing
- ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when
ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0
- fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D
- ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float)
so CPU flash attention can handle turbo3 K/V for models where CUDA FA
returns NONE (e.g. D=192 which has no CUDA FA kernel for any type)
- set-rows.cu: add tail kernel for non-128 remainder (future use)
- turbo-wht.cu: head-dim-aware processing with tail pass-through
- ops.cpp: CPU WHT head-dim-aware with tail identity copy
Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3
and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128)
continues to use turbo3 at full speed (223 t/s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13)
Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g.
GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed
with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu.
Fix:
- supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0,
so llama.cpp falls back to an unquantised KV path instead of asserting
- get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when
K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256})
Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned
heads silently use f16 KV while 128-aligned heads continue to use turbo3.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by
`nthreads` (8), but the Q register file is loaded in blocks of
`nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by
the f16/bf16 VEC kernels. This caused thread t>0 to pair K element
(16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete
index mismatch. Every generated token had garbage attention scores.
Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an
inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as
Q_v[k_KQ_0/nthreads + k_KQ_1].
Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp
that documented the symptom of this bug.
Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M:
- PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target)
- NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel тАФ 1 block per 128-element group,
128 threads per block (one thread per element):
тАв Shared-memory WHT butterfly (7 stages, no atomics)
тАв Warp-reduce L2 norm + inter-warp accumulate via smem
тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block тАФ load qs/signs once instead of twice per pair
тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75├Ч q8_0)
After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill
Remove the 5-line early-return that forced turbo3 onto the VEC flash
attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1)
via the existing dispatch logic, but prefill now goes through the Turing
MMA kernel (RTX 5090 is SM 12.0 >> 7.5).
launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are
set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and
ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original
CUDA port commit тАФ provide that conversion automatically. Stride
recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly
for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ
Before (VEC only): After (MMA for prefill):
2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0)
8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0)
32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0)
Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes).
Decode unchanged (VEC, ~0.64-0.82…
Summary
Autoresearch-discovered optimizations make the native TQ4_1S weight kernel 3.5× faster (68→240 t/s) and 36% faster than the q8_0 load-time conversion (176 t/s) while using 1.7× less VRAM (4.5 vs 7.5 GiB).
The load-time conversion is now opt-in (
GGML_TQ_CONVERT_Q8=1). The native kernel is strictly better on both speed and VRAM.Benchmarks (RTX 5090, Qwen2.5-7B TQ4_1S)
Speed
Quality (vs f16 baseline PPL 7.18)
Key optimizations (86 automated experiments, 15 kept)
__byte_permcentroid decode — zero-memory centroid lookupAlso includes
Test plan