perf: TQ4_1S native kernel 3.5× faster — 240 t/s, less VRAM than q8_0 conversion by signalnine · Pull Request #57 · TheTom/llama-cpp-turboquant

signalnine · 2026-04-06T15:10:45Z

Summary

Autoresearch-discovered optimizations make the native TQ4_1S weight kernel 3.5× faster (68→240 t/s) and 36% faster than the q8_0 load-time conversion (176 t/s) while using 1.7× less VRAM (4.5 vs 7.5 GiB).

The load-time conversion is now opt-in (GGML_TQ_CONVERT_Q8=1). The native kernel is strictly better on both speed and VRAM.

Benchmarks (RTX 5090, Qwen2.5-7B TQ4_1S)

Speed

Path	t/s	VRAM
Upstream V12 runtime	67	4.5 GiB
q8_0 load-time conversion	176	7.5 GiB
Native optimized (this PR)	240	4.5 GiB

Quality (vs f16 baseline PPL 7.18)

Metric	Native TQ4_1S	q8_0 conversion	q4_0
PPL	7.54	7.55	7.85
Mean KLD	0.056	0.057	0.078
Coherence	4/4 ✅	✅	✅
NIAH	5/5 ✅	✅	✅

Key optimizations (86 automated experiments, 15 kept)

fp16 activation buffer — halves activation bandwidth (the bottleneck)
Shared-memory centroid LUT — eliminates constant memory serialization on divergent lane access (+89%)
Half2 arithmetic + strided block processing — 2× arithmetic density
Vectorized 128-bit loads — uint32×4 weights, int4 activations (+45%)
Register __byte_perm centroid decode — zero-memory centroid lookup
NWARPS 8→4

Also includes

Load-time q8_0 conversion now opt-in (was default)
Autoresearch harness with coherence testing (server API + factual Q&A gate)

Test plan

Speed: 240 t/s (+253% vs upstream, +36% vs q8_0 conversion)
PPL: 7.54 (matches q8_0 conversion quality)
KLD: 0.056 (lower than q8_0 conversion and q4_0)
NIAH: 5/5 (Kamradt varied filler)
Coherence: 4/4 (Paris, 4, print, Shakespeare)
Ampere (sm_86) testing
TQ3_1S regression check

…AM than q8_0 Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel. Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time conversion (240 vs 176 t/s) while using 1.7× LESS VRAM (4.5 vs 7.5 GiB). Key optimizations (found via 86 automated experiments): 1. fp16 activation buffer — halves activation bandwidth (the bottleneck) 2. Shared-memory centroid LUT — eliminates constant memory serialization on divergent lane access (+89% single change) 3. Half2 arithmetic + strided block processing — 2× arithmetic density 4. Vectorized 128-bit loads — uint32×4 weights, int4 activations (+45%) 5. Register __byte_perm centroid decode — zero-memory centroid lookup 6. NWARPS 8→4 Also: - Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead of default. Native kernel is strictly better on both speed and VRAM. - Autoresearch harness gains coherence testing (server API + factual Q&A) to catch silent corruption that PPL alone misses. Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S): Upstream V12 runtime: 67 t/s (4.5 GiB VRAM) q8_0 conversion: 176 t/s (7.5 GiB VRAM) Native optimized: 240 t/s (4.5 GiB VRAM) ← this PR Quality (vs f16 baseline): PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55) Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078) NIAH: 5/5 Coherence: 4/4 (Paris, 4, print, Shakespeare)

signalnine · 2026-04-06T15:22:18Z

TQ3_1S Regression Check — PASS

Metric	TQ3_1S	TQ4_1S
Speed	102 t/s	240 t/s
PPL	7.43	7.54
Coherence	3/3 ✅	4/4 ✅
Size	3.71 GiB	4.53 GiB

TQ3_1S uses the unchanged V8 scalar fallback path — no regression. The optimizations (shmem LUT, half2, __byte_perm, vectorized loads) only apply to the TQ4_1S dp4a kernel.

TheTom · 2026-04-06T17:30:42Z

Regression Test Results

Mac (M5 Max 128GB, Metal)

Build: clean, zero errors. Qwen2.5-1.5B Q4_K_M:

Config	pp512	tg128
Baseline (f16 KV)	10,095	276.3
q8_0/turbo4 KV	9,666	172.9

No regressions. Metal path unaffected by CUDA changes.

Windows (GTX 1080 Ti 11GB, CUDA 12.6, SM 6.1)

Build: clean, zero errors. Qwen2.5-7B-Instruct:

Config	pp512	tg128	Previous	Delta
Q8_0 baseline	1,145	36.4	950 / 36.1	+20% prefill
Q8_0 + turbo4 KV	1,128	36.3	— / 36.2	match
Config I (TQ4_1S)	587	51.0	956 / 38.4	+33% decode
Config I + turbo4 KV	583	51.4	1110 / 39.0	+32% decode

Config I decode improvement (38.4 → 51.0 tok/s) is the native TQ4_1S kernel optimization showing up on Pascal. Prefill regression on Config I (956 → 587) needs investigation — may be related to the load-time conversion change (now opt-in).

All 4 configs produce valid output. No crashes, no build warnings from the PR's files.

signalnine · 2026-04-06T17:42:38Z

Prefill regression confirmed — root cause identified

The prefill regression (956→587 on Pascal, 13K→6.5K on 5090) is from disabling the load-time q8_0 conversion. Native TQ4_1S prefill falls through to cuBLAS dequant-to-f16, which requires a full inverse WHT per element — much slower than the q8_0 cuBLAS path that only needs standard q8_0 dequant.

Path	PP512	TG128	VRAM
Native TQ4_1S	6,479	226	4.5 GiB
q8_0 conversion	13,312	176	7.5 GiB

The tradeoff is clear: native wins decode (+29%), q8_0 wins prefill (+105%). Need a hybrid that uses native for decode and q8_0 for prefill — or accept the prefill cost for the VRAM savings.

Options:

Default to conversion (old behavior) — best prefill, OK decode, more VRAM
Default to native (current PR) — best decode, poor prefill, less VRAM
Hybrid: convert at load time but also keep native kernel for decode — best of both but 1.7× VRAM (stores both q8_0 AND TQ4_1S... defeats the purpose)
User choice via env var — GGML_TQ_NATIVE=1 for decode-heavy workloads

For now I'll switch the default back to conversion (option 1) since prefill is more commonly the bottleneck, and document the native opt-in for decode-heavy use cases. The +33% decode on Pascal and +29% on 5090 is real and valuable for interactive chat with long context.

TheTom · 2026-04-06T17:46:40Z

Testing on my AMD now for posterity, Asking a community member to verify with his CUDA.

TheTom · 2026-04-06T17:54:15Z

Prefill regression confirmed — root cause identified

The prefill regression (956→587 on Pascal, 13K→6.5K on 5090) is from disabling the load-time q8_0 conversion. Native TQ4_1S prefill falls through to cuBLAS dequant-to-f16, which requires a full inverse WHT per element — much slower than the q8_0 cuBLAS path that only needs standard q8_0 dequant.

Path PP512 TG128 VRAM
Native TQ4_1S 6,479 226 4.5 GiB
q8_0 conversion 13,312 176 7.5 GiB
The tradeoff is clear: native wins decode (+29%), q8_0 wins prefill (+105%). Need a hybrid that uses native for decode and q8_0 for prefill — or accept the prefill cost for the VRAM savings.

Options:

Default to conversion (old behavior) — best prefill, OK decode, more VRAM

Default to native (current PR) — best decode, poor prefill, less VRAM

Hybrid: convert at load time but also keep native kernel for decode — best of both but 1.7× VRAM (stores both q8_0 AND TQ4_1S... defeats the purpose)

User choice via env var — GGML_TQ_NATIVE=1 for decode-heavy workloads

For now I'll switch the default back to conversion (option 1) since prefill is more commonly the bottleneck, and document the native opt-in for decode-heavy use cases. The +33% decode on Pascal and +29% on 5090 is real and valuable for interactive chat with long context.

good analysis on the prefill tradeoff. option 1 is the right call. defaulting to conversion keeps existing behavior, and documenting native opt-in for decode-heavy workloads gives users the choice without surprising anyone.

found a compile issue on AMD/ROCm will patch with HIP guards. wanted to flag it before merge.

+33% decode on pascal is a great result. the autoresearch harness is proving its worth.

Replace per-element generic dequant template (which repeats the full 32-element WHT butterfly 16 times per block) with a warp-cooperative version using __shfl_xor_sync. One WHT per block instead of 16. Note: this improves the dequant kernel itself but doesn't fix the prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs the q8_0 conversion path's native int8 tensor core GEMM. The dequant was never the slow part — the GEMM dispatch is fundamentally different. For prefill-heavy workloads, load-time q8_0 conversion remains the recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy interactive chat where the +29% decode speed matters more.

TheTom · 2026-04-06T18:09:34Z

heads up from a community tester — turbo-innerq.cuh has a build issue on static builds (BUILD_SHARED_LIBS=OFF). The TURBO_IQ_API macro assumes GGML_BACKEND_BUILD is defined, but it's only set for shared builds. Fix: wrap the block in #ifdef GGML_BACKEND_SHARED matching the pattern from ggml-backend.h. For static builds TURBO_IQ_API should be empty (no dllexport/dllimport).

signalnine · 2026-04-06T19:00:43Z

Static build fix + multi-token kernel update

Fixed the TURBO_IQ_API static build issue — wrapped in #ifdef GGML_BACKEND_SHARED matching the GGML_BACKEND_API pattern from ggml-backend.h. For BUILD_SHARED_LIBS=OFF, TURBO_IQ_API is now empty (no dllexport/dllimport).

Also added a multi-token TQ4_1S dp4a kernel for ne[1] ≤ 8 (speculative decoding, small batches). Loads weight data once per block, reuses across all tokens. For ne[1] > 8, native path now does runtime TQ4_1S→fp16 dequant + cuBLAS instead of the old per-element cuBLAS fallback.

Default remains load-time q8_0 conversion (option 1, as agreed). GGML_TQ_NATIVE=1 opts into native decode for VRAM savings.

Config	pp512	tg128	VRAM
Default (q8_0 conversion)	13,291	175	7.5 GiB
`GGML_TQ_NATIVE=1`	5,492	226	4.5 GiB

Re: HIP — happy to coordinate on the guards once your patch is ready.

TheTom · 2026-04-06T19:50:56Z

Community testing: dual RTX 4090 results

Tested on dual 4090s with layer split. Found a multi-GPU bug and fix.

Results

Metric	Native TQ4_1S	q8_0 Conversion	Delta
Prefill pp32768	2,318 t/s	3,513 t/s	q8_0 wins +51.5%
Decode tg128	38.55 t/s	32.05 t/s	Native wins +20.3%

Multi-GPU bug

Native TQ4_1S kernel crashes during decode on multi-GPU (dual 4090s). Prefill works fine (goes through cuBLAS mat_mul). Decode hits the native kernel which uses global static CUDA device pointers (static half * d_act_buf). cudaMalloc allocates on the current device, so:

First call on GPU 0 allocates buffers on GPU 0
Next call on GPU 1 tries to use GPU 0's buffers from GPU 1 — illegal memory access

q8_0 conversion path works for both prefill and decode because it uses standard q8_0 mul_mat_vec.

Fix

Replace static global buffers with per-device pool allocations from ctx.pool() via ggml_cuda_pool_alloc, matching the pattern in mmvq.cu. Changes in mmvq-tq.cu:

Remove static d_act_buf / d_q8_1_buf globals
Use ggml_cuda_pool_alloc for activation buffers (per-call, per-device)
Update kernel launch calls to use q8_1_buf.ptr / act_buf.ptr instead of static pointers

This makes the buffers per-device and per-call, fixing multi-GPU decode.

signalnine · 2026-04-06T20:53:02Z

Multi-GPU fix applied

Replaced static global d_act_buf / d_q8_1_buf with per-device ggml_cuda_pool_alloc from ctx.pool(id), matching the pattern in mmvq.cu. Buffers are now per-call and per-device — fixes the illegal memory access on dual GPU decode.

No performance regression on single GPU (pp512=5,474, tg128=225 with GGML_TQ_NATIVE=1).

- Multi-token dp4a kernel for ne[1]≤8 (speculative decoding, small batches) loads weight data once per block, reuses across all ncols_dst tokens - Runtime TQ4_1S→fp16 dequant + cuBLAS for ne[1]>8 prefill - Fix multi-GPU crash: replace static global CUDA buffers with per-device pool allocations from ctx.pool(id), matching mmvq.cu pattern - Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED

TheTom · 2026-04-06T21:04:58Z

AMD HIP/ROCm testing: RX 9070 XT (RDNA 4, gfx1201)

Tested by community member on Windows 11, HIP SDK 7.1, Qwen2.5-1.5B-Instruct.

Build fix required

__dp4a is NVIDIA-only. Needs to use the vendor-abstracted ggml_cuda_dp4a wrapper from common.cuh (8 replacements in mmvq-tq.cu). Diff:

-        const int s0 = __dp4a(c0, a0, 0);
+        const int s0 = ggml_cuda_dp4a(c0, a0, 0);

Regression results (Qwen2.5-1.5B, RX 9070 XT)

Config	Size	PPL	Prefill pp512	Decode tg128	vs Q8_0
Q8_0	1.76 GiB	11.119	6,917 t/s	167 t/s	baseline
Config I (TQ4_1S)	1.28 GiB	11.316	4,733 t/s	101 t/s	tg 60%

PPL: No regression (+1.8% vs Q8_0).

Regression vs previous kernel (PR #45 V12)

Kernel	Decode	Prefill	TG vs Q8_0
PR 45 (V12 float)	135 t/s	7,134 t/s	130%
PR 57 (dp4a int8)	101 t/s	4,733 t/s	60%

PR 57's dp4a path is optimized for NVIDIA tensor cores (Ampere+/Blackwell) where it achieves 240 t/s (+3.5x). On RDNA 4, the sudot4 intrinsic has different throughput characteristics and the int8 activation path adds conversion overhead, resulting in a regression vs the V12 float kernel.

May need an architecture-specific fallback to V12 for AMD GPUs, or an #ifdef to select the better path per vendor.

signalnine · 2026-04-06T22:08:51Z

Fixed — replaced all 8 __dp4a calls with ggml_cuda_dp4a from common.cuh. Pushed.

Re: RDNA4 regression — the dp4a/int8 path was designed around NVIDIA's dp4a throughput (Turing+). On RDNA4, sudot4 has different throughput characteristics and the q8_1 activation quantization adds overhead that the float path doesn't have. Two options:

Arch dispatch: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time, route AMD to the V12 float kernel (pre-rotate to half + shmem centroid LUT), keep dp4a for NVIDIA
Runtime heuristic: benchmark both paths on first call, cache the winner

Option 1 is simpler and probably sufficient. Happy to add the AMD fallback path if you want it in this PR, or we can split it out.

TheTom · 2026-04-07T13:51:03Z

RDNA4 arch dispatch fix — cherry-pick ready

Branch: fix/rdna4-arch-dispatch
Commit: 96e211d

What it does

Routes AMD GPUs to a scalar half-precision TQ4_1S kernel instead of the dp4a int8 path. NVIDIA continues using dp4a.

New kernel: mul_mat_tq4_1s_scalar_multi — pre-rotated half activations, shmem centroid LUT (16 entries), scalar dot product. Same pattern as TQ3_1S, no dp4a/__byte_perm.
Dispatch: use_dp4a = !GGML_CUDA_CC_IS_AMD(cc) && TQ4_1S. AMD falls through to the scalar path alongside TQ3_1S.
LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch to avoid duplicating the switch statements.

Expected RDNA4 result

Restore V12-level decode (~135 t/s, 130% of Q8_0) instead of dp4a regression (101 t/s, 60% of Q8_0).

Verified

Metal build: clean, zero errors (mmvq-tq.cu is CUDA-only, no Metal impact)
One file changed: ggml/src/ggml-cuda/mmvq-tq.cu (+98, -20)

Cherry-pick

git fetch https://github.com/TheTom/llama-cpp-turboquant.git fix/rdna4-arch-dispatch
git cherry-pick 96e211d

The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput (240 t/s on 5090). On RDNA4, sudot4 has different throughput characteristics and the q8_1 activation quantization adds overhead, causing a regression vs the V12 float kernel (101 vs 135 t/s on RX 9070 XT). Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD GPUs to a scalar half-precision kernel (same pattern as TQ3_1S). NVIDIA continues using the dp4a path. Changes: - Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations, shmem centroid LUT, scalar dot product (no dp4a/byte_perm) - Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path. - LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0) instead of dp4a regression (101 t/s, 60% of Q8_0). Co-Authored-By: Tom Turney <tturney1@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

signalnine · 2026-04-07T15:19:56Z

Cherry-picked 96e211d — builds clean, no NVIDIA regression (pp512=5,140, tg128=225). Pushed.

Clean approach with the LAUNCH_SCALAR macro. Looking forward to the RDNA4 retest numbers.

TheTom · 2026-04-07T15:26:44Z

Thanks I'll test it out on my local PC.

TheTom · 2026-04-07T23:09:55Z

RDNA4 retest — RX 9070 XT (gfx1201, HIP SDK 7.1) — ALL PASS

With the arch dispatch fix (commit 96e211d cherry-picked by @signalnine):

Speed (Qwen2.5-1.5B-Instruct, r=10)

Config	Size	Prefill pp512	Decode tg128	vs Q8_0
Base Q8_0	1.76 GiB	12,412	140	baseline
PR57 Q8_0	1.76 GiB	12,365	156	no regression
PR57 Config I (TQ4_1S)	1.28 GiB	8,922	166	+19%

Before vs after fix

Metric	dp4a path (before)	Scalar half path (after)
Config I decode	101 t/s (60% of Q8_0)	166 t/s (119% of Q8_0)

PPL (wikitext-2-raw, 512 context, 20 chunks)

Config	PPL
Q8_0	11.119
Config I	11.337 (+2.0%)

PPL: PASS — no regression.

Output quality

Prompt: "What is the capital of France? Answer briefly."
Output: "The capital of France is Paris."

PASS — coherent, correct.

Note for AMD testers

RDNA 4 has severe cold-start JIT variance. With r=3 or r=5, the first run drags the average down to 24-34 t/s. Always use r=10 on this GPU to get stable warm-state numbers.

seanrasch · 2026-04-08T00:10:25Z

Ampere (RTX 3080 Ti, SM 86) — native kernel parity with q8_0 convert

Tested this PR (pr/tq4-weight-opts @ 686220987) on RTX 3080 Ti. The 36% speedup you saw on RTX 5090 does not transfer to Ampere — native and q8_0 convert are within noise on SM 86.

Setup

Hardware: RTX 3080 Ti (SM 86, 12GB, ~760 GB/s bandwidth)
Model: Qwen2.5-7B-Instruct-TQ4_1S (quantized locally from bartowski f16)
Build: cmake -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_CUDA_FA_ALL_QUANTS=ON
Tool: llama-bench -p 512 -n 128 -r 3, single GPU via CUDA_VISIBLE_DEVICES=1

Results

Path	pp512 (t/s)	tg128 (t/s)
Native (this PR)	4704 ± 186	94.05 ± 0.15
q8_0 convert (`GGML_TQ_CONVERT_Q8=1`)	4664 ± 183	93.98 ± 0.21

Delta: +0.9% pp512, 0% tg128 — both within run-to-run variance.

Interpretation

On 5090 you saw native beat convert 240 → 176 t/s (+36%). On 3080 Ti they're identical. The PR's kernel optimizations (fp16 activation buffer, SMEM centroid LUT, half2 strided blocks, vectorized 128-bit loads, __byte_perm centroid decode, NWARPS 8→4) appear to be Ada/Blackwell-specific wins. On Ampere we're bottlenecked elsewhere — likely memory bandwidth for weight loading, or insufficient SMEM/L2 to realize the LUT gains.

Not a regression — native is at least as fast as convert on Ampere, and the lower-VRAM advantage still holds. Just not the headline 3.5x. The case for merging still stands on Ampere: equal speed, lower memory, simpler code path.

Cross-datapoint on the same build, Qwen3.5-9B TQ4_1S (48 layers, 8:1 GQA): convert actually edges native by ~5% pp512 (3776 vs 3592) — consistent with "native optimizations don't help on Ampere", and may hint that deeper models expose some overhead in the native path specifically on SM 86. Worth a look if you want to chase it.

Happy to run additional configs (ctx sweep, different models) if useful.

TheTom · 2026-04-08T00:42:05Z

@seanrasch thanks for the Ampere validation, that was the last unchecked box on the test plan. really appreciate the extra eyes.

good to confirm no regression on SM 86. the parity result makes sense — the kernel wins (shmem LUT, half2, vectorized loads) are latency optimizations that only show up when you're not bandwidth-bound, which Ampere is at these model sizes.

the 5% prefill gap on deeper models is a useful data point. lines up with what we'd expect from the per-element inverse WHT overhead in the native dequant path. default is conversion-on for prefill so it shouldn't bite anyone in practice.

thanks again for taking the time to test and write it up.

Splits the dequant+accumulate phase into two sub-loops: 1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup + scale multiply, reads from weight buffer only). 2. Read the rotated activation from shared memory ONCE per column, then FMA across all rows in a tight register loop. This is the Vulkan analogue of the 'hot loop load dedup' from the CUDA kernel (PR TheTom#57 optimisation TheTom#2). It makes the shared memory read explicitly loop-invariant across rows, which helps compilers that don't auto-hoist LDS loads out of unrolled loops. Measured effect on Intel Arc A380 (Llama-3.2-3B premium, llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise but not a regression). The structure is cleaner regardless and should benefit architectures with higher LDS latency.

* vulkan: add TQ4_1S weight compression support Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight compression with 16 Lloyd-Max centroids, 32-element blocks). Shaders: - dequant_tq4_1s.comp: standalone dequant with WHT inverse via subgroupShuffleXor (32-thread workgroup, 5-stage butterfly) - mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline activation pre-rotation (forward RHT on activation, centroid*scale dequant without inverse RHT) - copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse - copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward RHT, dual half-block RMS scales, 16-centroid quantization - types.glsl: block_tq4_1s struct (d0, d1, qs[16]) - dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT) Pipeline wiring (ggml-vulkan.cpp): - MUL_MAT, SET_ROWS, CPY supports_op - pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32 - Specialized MUL_MAT_VEC with forced subgroup workgroup size Tests: - test_set_rows_tq4_1s: SET_ROWS round-trip validation * vulkan: add fused mul_mat_vec kernel for TQ4_1S Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the per-decode-step matrix-vector product no longer has to dequant the full weight tensor to f16 and then go through the generic matmul path. The kernel pre-rotates the activation via a forward Walsh-Hadamard Transform in shared memory and dot-products against the raw centroid*scale stored weights, folding the inverse-WHT on the weight side into the activation by the symmetry H = H^T. Math: w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k] sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j] Portability choices: - Workgroup size is pinned to 32 threads regardless of the DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for the current architecture. The butterfly operates on 32-element blocks with one element per thread; that contract is fixed by the quantization format, not by the GPU. Earlier revisions used `gl_WorkGroupSize.x` as the stride unit, which silently skipped half the work on Intel drivers that force the subgroup to 16 (tests passed via NMSE tolerance while real inference output was garbage). - Butterfly implementation is shared memory only. A subgroup-shuffle variant (`subgroupShuffleXor`) was prototyped and measured on Intel Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the explicit shared-memory butterfly, because Mesa emulates subgroup shuffles via LDS and ends up doing the same LDS traffic with extra driver overhead. The shared-memory butterfly is correct on every device regardless of subgroup-op support, is the fastest path on every device we can actually measure, and leaves the `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform across all DMMV_WG_SIZE buckets. - Reduction is the shared-memory tree reduction (no subgroupAdd), for the same reason: on Intel Arc the subgroupAdd is also LDS-backed and the hybrid reduction path was measurably slower. Future vendor-specific heuristics can switch to the hybrid or pure-subgroup reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops turn out to beat the LDS roundtrip there; the existing reduction modes in `mul_mat_vec_base.glsl` already provide the necessary variants. - NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows per workgroup. Each thread holds one position of each of the 8 weight blocks and pairs them with the shared rotated activation. - `mul_mm` and `flash_attn_cm2` shader generation is skipped for TQ4_1S because it is a weight-only format that never reaches the coopmat2 matmul or the KV cache flash-attention paths. Tests: - `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01 NMSE so real defects can't hide behind a loose check. - Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536, 2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test cases total. Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG, `llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512 --chunks 20 wiki.test.raw`): | Model | Config | Size | Reduction | PPL Δ | pp512/Q8 | tg128/Q8 | |---------------|---------|----------:|----------:|-------:|---------:|---------:| | Qwen2.5-1.5B | I | 1570→1082 | -31.1% | +4.66% | 53.9% | 107.5% | | Phi-3.5-mini | I | 3873→2839 | -26.7% | +5.36% | 57.6% | 52.8% | | Llama-3.2-3B | hybrid | 3263→2147 | -34.2% | +2.03% | 82.4% | 84.2% | | Llama-3.2-3B | premium | 3263→2577 | -21.0% | +0.98% | 71.3% | 67.3% | Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I: the compressed model fits in less VRAM, and on a small model the TQ4_1S compute cost is offset by the reduced memory traffic. All four models produce coherent output end-to-end and the reductions line up with the TurboQuant paper's validation matrix (§5.8). The remaining gap to Q8_0 on the bigger models is compute-bound on the A380; it closes further on GPUs with more raw throughput. * vulkan: restructure TQ4_1S inner loop for cross-row smem reuse Splits the dequant+accumulate phase into two sub-loops: 1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup + scale multiply, reads from weight buffer only). 2. Read the rotated activation from shared memory ONCE per column, then FMA across all rows in a tight register loop. This is the Vulkan analogue of the 'hot loop load dedup' from the CUDA kernel (PR #57 optimisation #2). It makes the shared memory read explicitly loop-invariant across rows, which helps compilers that don't auto-hoist LDS loads out of unrolled loops. Measured effect on Intel Arc A380 (Llama-3.2-3B premium, llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise but not a regression). The structure is cleaner regardless and should benefit architectures with higher LDS latency.

* vulkan: add TQ4_1S weight compression support Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight compression with 16 Lloyd-Max centroids, 32-element blocks). Shaders: - dequant_tq4_1s.comp: standalone dequant with WHT inverse via subgroupShuffleXor (32-thread workgroup, 5-stage butterfly) - mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline activation pre-rotation (forward RHT on activation, centroid*scale dequant without inverse RHT) - copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse - copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward RHT, dual half-block RMS scales, 16-centroid quantization - types.glsl: block_tq4_1s struct (d0, d1, qs[16]) - dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT) Pipeline wiring (ggml-vulkan.cpp): - MUL_MAT, SET_ROWS, CPY supports_op - pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32 - Specialized MUL_MAT_VEC with forced subgroup workgroup size Tests: - test_set_rows_tq4_1s: SET_ROWS round-trip validation * vulkan: add fused mul_mat_vec kernel for TQ4_1S Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the per-decode-step matrix-vector product no longer has to dequant the full weight tensor to f16 and then go through the generic matmul path. The kernel pre-rotates the activation via a forward Walsh-Hadamard Transform in shared memory and dot-products against the raw centroid*scale stored weights, folding the inverse-WHT on the weight side into the activation by the symmetry H = H^T. Math: w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k] sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j] Portability choices: - Workgroup size is pinned to 32 threads regardless of the DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for the current architecture. The butterfly operates on 32-element blocks with one element per thread; that contract is fixed by the quantization format, not by the GPU. Earlier revisions used `gl_WorkGroupSize.x` as the stride unit, which silently skipped half the work on Intel drivers that force the subgroup to 16 (tests passed via NMSE tolerance while real inference output was garbage). - Butterfly implementation is shared memory only. A subgroup-shuffle variant (`subgroupShuffleXor`) was prototyped and measured on Intel Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the explicit shared-memory butterfly, because Mesa emulates subgroup shuffles via LDS and ends up doing the same LDS traffic with extra driver overhead. The shared-memory butterfly is correct on every device regardless of subgroup-op support, is the fastest path on every device we can actually measure, and leaves the `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform across all DMMV_WG_SIZE buckets. - Reduction is the shared-memory tree reduction (no subgroupAdd), for the same reason: on Intel Arc the subgroupAdd is also LDS-backed and the hybrid reduction path was measurably slower. Future vendor-specific heuristics can switch to the hybrid or pure-subgroup reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops turn out to beat the LDS roundtrip there; the existing reduction modes in `mul_mat_vec_base.glsl` already provide the necessary variants. - NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows per workgroup. Each thread holds one position of each of the 8 weight blocks and pairs them with the shared rotated activation. - `mul_mm` and `flash_attn_cm2` shader generation is skipped for TQ4_1S because it is a weight-only format that never reaches the coopmat2 matmul or the KV cache flash-attention paths. Tests: - `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01 NMSE so real defects can't hide behind a loose check. - Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536, 2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test cases total. Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG, `llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512 --chunks 20 wiki.test.raw`): | Model | Config | Size | Reduction | PPL Δ | pp512/Q8 | tg128/Q8 | |---------------|---------|----------:|----------:|-------:|---------:|---------:| | Qwen2.5-1.5B | I | 1570→1082 | -31.1% | +4.66% | 53.9% | 107.5% | | Phi-3.5-mini | I | 3873→2839 | -26.7% | +5.36% | 57.6% | 52.8% | | Llama-3.2-3B | hybrid | 3263→2147 | -34.2% | +2.03% | 82.4% | 84.2% | | Llama-3.2-3B | premium | 3263→2577 | -21.0% | +0.98% | 71.3% | 67.3% | Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I: the compressed model fits in less VRAM, and on a small model the TQ4_1S compute cost is offset by the reduced memory traffic. All four models produce coherent output end-to-end and the reductions line up with the TurboQuant paper's validation matrix (§5.8). The remaining gap to Q8_0 on the bigger models is compute-bound on the A380; it closes further on GPUs with more raw throughput. * vulkan: restructure TQ4_1S inner loop for cross-row smem reuse Splits the dequant+accumulate phase into two sub-loops: 1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup + scale multiply, reads from weight buffer only). 2. Read the rotated activation from shared memory ONCE per column, then FMA across all rows in a tight register loop. This is the Vulkan analogue of the 'hot loop load dedup' from the CUDA kernel (PR TheTom#57 optimisation TheTom#2). It makes the shared memory read explicitly loop-invariant across rows, which helps compilers that don't auto-hoist LDS loads out of unrolled loops. Measured effect on Intel Arc A380 (Llama-3.2-3B premium, llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise but not a regression). The structure is cleaner regardless and should benefit architectures with higher LDS latency.

Builds Mac Metal (arm64) and Windows CUDA (12.4) on tag push. Creates GitHub release with prebuilt binaries. vulkan: TQ4_1s support for model weights (#69) * vulkan: add TQ4_1S weight compression support Adds Vulkan shader support for TQ4_1S (4-bit WHT-rotated weight compression with 16 Lloyd-Max centroids, 32-element blocks). Shaders: - dequant_tq4_1s.comp: standalone dequant with WHT inverse via subgroupShuffleXor (32-thread workgroup, 5-stage butterfly) - mul_mat_vec_tq4_1s.comp: specialized MUL_MAT_VEC with inline activation pre-rotation (forward RHT on activation, centroid*scale dequant without inverse RHT) - copy_from_quant.comp: TQ4_1S dequant path with full WHT inverse - copy_to_quant.comp: TQ4_1S SET_ROWS quantization path with forward RHT, dual half-block RMS scales, 16-centroid quantization - types.glsl: block_tq4_1s struct (d0, d1, qs[16]) - dequant_funcs.glsl: TQ4_1S centroid*scale dequant (no RHT) Pipeline wiring (ggml-vulkan.cpp): - MUL_MAT, SET_ROWS, CPY supports_op - pipeline_dequant, pipeline_set_rows, pipeline_cpy_quant_f32 - Specialized MUL_MAT_VEC with forced subgroup workgroup size Tests: - test_set_rows_tq4_1s: SET_ROWS round-trip validation * vulkan: add fused mul_mat_vec kernel for TQ4_1S Adds a specialised MUL_MAT_VEC shader for TQ4_1S weights so the per-decode-step matrix-vector product no longer has to dequant the full weight tensor to f16 and then go through the generic matmul path. The kernel pre-rotates the activation via a forward Walsh-Hadamard Transform in shared memory and dot-products against the raw centroid*scale stored weights, folding the inverse-WHT on the weight side into the activation by the symmetry H = H^T. Math: w[k] = sign[k] * INV_SQRT32 * (H @ stored)[k] sum_k w[k] * a[k] = INV_SQRT32 * sum_j stored[j] * (H @ (sign * a))[j] Portability choices: - Workgroup size is pinned to 32 threads regardless of the DMMV_WG_SIZE bucket the rest of the mul_mat_vec family picks for the current architecture. The butterfly operates on 32-element blocks with one element per thread; that contract is fixed by the quantization format, not by the GPU. Earlier revisions used `gl_WorkGroupSize.x` as the stride unit, which silently skipped half the work on Intel drivers that force the subgroup to 16 (tests passed via NMSE tolerance while real inference output was garbage). - Butterfly implementation is shared memory only. A subgroup-shuffle variant (`subgroupShuffleXor`) was prototyped and measured on Intel Arc A380 with Mesa Xe HPG: it ran ~60-85 %% slower than the explicit shared-memory butterfly, because Mesa emulates subgroup shuffles via LDS and ends up doing the same LDS traffic with extra driver overhead. The shared-memory butterfly is correct on every device regardless of subgroup-op support, is the fastest path on every device we can actually measure, and leaves the `pipeline_dequant_mul_mat_vec_f32_f32[w][TQ4_1S]` slot uniform across all DMMV_WG_SIZE buckets. - Reduction is the shared-memory tree reduction (no subgroupAdd), for the same reason: on Intel Arc the subgroupAdd is also LDS-backed and the hybrid reduction path was measurably slower. Future vendor-specific heuristics can switch to the hybrid or pure-subgroup reduction variants on NVIDIA / AMD RDNA if hardware subgroup ops turn out to beat the LDS roundtrip there; the existing reduction modes in `mul_mat_vec_base.glsl` already provide the necessary variants. - NUM_ROWS is 8 so the butterfly cost amortises across 8 output rows per workgroup. Each thread holds one position of each of the 8 weight blocks and pairs them with the shared rotated activation. - `mul_mm` and `flash_attn_cm2` shader generation is skipped for TQ4_1S because it is a weight-only format that never reaches the coopmat2 matmul or the KV cache flash-attention paths. Tests: - `test-backend-ops` MUL_MAT tolerance tightened from 2.0 to 0.01 NMSE so real defects can't hide behind a loose check. - Added Gemma-4 E2B, Qwen, Phi and Llama dimensional coverage (k in {1536, 2048, 2304, 3072, 4096}, m in {256, 1152, 1536, 2048, 5120, 6144}, n in {1..8, 16, 64, 256}). 148 MUL_MAT test cases total. Verification (Intel Arc A380, 6 GB VRAM, Vulkan ANV / Mesa Xe HPG, `llama-bench -p 512 -n 128 -r 3` and `llama-perplexity -c 512 --chunks 20 wiki.test.raw`): | Model | Config | Size | Reduction | PPL ╬Ф | pp512/Q8 | tg128/Q8 | |---------------|---------|----------:|----------:|-------:|---------:|---------:| | Qwen2.5-1.5B | I | 1570тЖТ1082 | -31.1% | +4.66% | 53.9% | 107.5% | | Phi-3.5-mini | I | 3873тЖТ2839 | -26.7% | +5.36% | 57.6% | 52.8% | | Llama-3.2-3B | hybrid | 3263тЖТ2147 | -34.2% | +2.03% | 82.4% | 84.2% | | Llama-3.2-3B | premium | 3263тЖТ2577 | -21.0% | +0.98% | 71.3% | 67.3% | Qwen2.5-1.5B is faster than its own Q8_0 baseline with Config I: the compressed model fits in less VRAM, and on a small model the TQ4_1S compute cost is offset by the reduced memory traffic. All four models produce coherent output end-to-end and the reductions line up with the TurboQuant paper's validation matrix (┬з5.8). The remaining gap to Q8_0 on the bigger models is compute-bound on the A380; it closes further on GPUs with more raw throughput. * vulkan: restructure TQ4_1S inner loop for cross-row smem reuse Splits the dequant+accumulate phase into two sub-loops: 1. Pre-compute w_vals[n] for all NUM_ROWS rows (centroid lookup + scale multiply, reads from weight buffer only). 2. Read the rotated activation from shared memory ONCE per column, then FMA across all rows in a tight register loop. This is the Vulkan analogue of the 'hot loop load dedup' from the CUDA kernel (PR #57 optimisation #2). It makes the shared memory read explicitly loop-invariant across rows, which helps compilers that don't auto-hoist LDS loads out of unrolled loops. Measured effect on Intel Arc A380 (Llama-3.2-3B premium, llama-bench tg128, r=5): 15.50 -> 15.78 t/s (+1.8%, within noise but not a regression). The structure is cleaner regardless and should benefit architectures with higher LDS latency. fix: inverse WHT in test-turbo-quant.c round-trip (#59) fix(metal): disable TurboFlash by default тАФ corrupt output on Apple10 The TurboFlash two-pass fused attention kernel produces garbage output on M5 Max (Apple10/Metal4) for all turbo3 V configs. Disabling by default routes turbo3 through the standard FA path which works correctly. Users can opt-in with TURBO_FLASH=1 for testing/debugging. No perf regression тАФ standard FA path matches TurboFlash speed within noise (~55-57 t/s tg128 for q8_0/turbo3 on M5 Max). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: gate turbo V unpad on V type, not K type fix(hip): bypass pool for FA f16 temp buffers to prevent OOM The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations permanently. For quantized KV flash attention, the f16 dequant temp buffers stay allocated in the pool after use, consuming more VRAM than the KV compression saves. This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM before f16 at equivalent context lengths on HIP/ROCm. Fix: on HIP, allocate f16 temp buffers with raw hipMalloc and free with hipFree (via RAII destructor) instead of the pool. Memory is released after the FA kernel completes via hipStreamSynchronize. Compared to commit 1 (VEC force), this approach recovers prefill: pp32768 q8_0/q8_0: 1137 t/s (VEC) -> 3060 t/s (bypass) = +169% pp32768 q8_0/turbo4: 1011 t/s (VEC) -> 3061 t/s (bypass) = +203% Decode: unchanged across all configs Trade-off: one hipStreamSynchronize per FA call (~5% overhead at 32k). Can be eliminated in future with hipFreeAsync (stream-ordered free). Root cause: ggml_cuda_pool_leg::free() stores buffers for reuse and never calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory. On HIP without VMM (gfx1100/gfx1200), pool permanently consumes peak VRAM. Confirmed hardware: gfx1201 (RX 9070 XT, 16GB, Windows 11, HIP 7.1) Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP) Refs: ggml-org/llama.cpp#20969 Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: force VEC FA path for quantized KV on HIP/ROCm The TILE/MMA/WMMA FA paths allocate unbounded f16 temp buffers proportional to full KV cache length for any quantized KV type. On ROCm/HIP these pool allocations persist at peak size, meaning the temp buffer VRAM exceeds the savings from KV compression. This causes quantized KV (q8_0, q4_0, turbo3, turbo4) to OOM before f16 at the same context length. Confirmed on gfx1100 (RX 7900 XT) and gfx1200 (RX 9060 XT). Also affects stock llama.cpp q4_0/q8_0 (not TurboQuant-specific). Fix: on HIP, force VEC path for quantized KV when available (head_dim <= 256). VEC does inline dequant with no temp buffer. Trade-off: prefill throughput may decrease (VEC processes queries sequentially). Decode is unaffected since VEC was already selected for single-token generation. Limitation: head_dim > 256 (e.g. Gemma 4 full_attention d=512) cannot use VEC and still routes through TILE. Bounded temp buffer in a separate compilation unit is the proper fix for those cases (see domvox/llama.cpp-turboquant-hip a13c3db12 discussion). Refs: ggml-org/llama.cpp#20969 (community reports) Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> vulkan: fix turbo3 build + coopmat FA after April upstream sync Origin's April upstream-sync rebase interleaved two changes that left the Vulkan turbo3 KV path broken: * ggml-org/llama.cpp upstream PR #21572 (1f30ac0ce) moved fp16 RTE rounding to a runtime SPIR-V patch and dropped the _rte shader variants plus rte.glsl itself. * TheTom/llama-cpp-turboquant PR #62 (ff8bb7394) added turbo3 KV support against a base that still had those variants. After the rebase, the tree had dangling cpy_f32_*_rte_len / _data references, a two-arg SET_ROWS macro called with one arg, a MMQ shader variants generated for turbo3_0 even though the flash_attn MMQ path has no turbo3 code. The result was that ggml-vulkan.cpp failed to compile on a clean checkout (spirv-headers + all of the above) and the shader-gen emitted garbage variants. Separately, turbo3 flash-attn pipelines were only wired up for FA_SCALAR. On a coopmat-capable device (e.g. RADV on a 7900 XTX) the tuning heuristic picks FA_COOPMAT1 for most shapes, which landed in ggml_vk_flash_attn with an uninitialized pipeline (wg_denoms={0,0,0}) and tripped the Br == wg_denoms[0] assertion as soon as a prefill ubatch was dispatched. End-to-end llama-cli on Vulkan + -ctk turbo3 aborted on the first real forward pass. Changes: * Drop the if (float_controls_rte_fp16) / else branches around cpy_f32_quant pipeline creation and collapse SET_ROWS to a single variant, matching upstream post-1f30ac0ce. * Remove the #include "rte.glsl" from copy_to_quant.comp. * Skip the MMQ flash_attn shader variant for turbo3_0 in the shader generator (no MMQ code path for it). * Register CREATE_FA(GGML_TYPE_TURBO3_0, turbo3_0, FA_COOPMAT1, _cm1) and the _cm2 counterpart alongside the other quant types. Verified on AMD 7900 XTX (gfx1100 / RADV NAVI31, ROCm 7.2.1 + Vulkan 1.4.341, spirv-headers 1.4.341.0): * Full HIP+Vulkan build is clean with no shader compile errors. * test-backend-ops -o SET_ROWS -b Vulkan0 : 147/147 * test-backend-ops -o FLASH_ATTN_EXT -b Vulkan0 -p type_KV=turbo3 : 530 cases pass (previously aborted on case 3). * test-backend-ops -o FLASH_ATTN_EXT -b ROCm0 -p type_KV=turbo3 : still green (no HIP regression). * llama-cli on Qwen3-8B Q4_K_M with -ngl 99 -fa on -ctk turbo3 -ctv turbo3 on Vulkan0 no longer aborts. The remaining head_dim=128 correctness issue on the Vulkan turbo3 decode path is pre-existing and orthogonal to this change. llama-bench on Qwen3.5-27B Q4_K_M, 7900 XTX OC, HIP backend: F16 tg128=20.98 turbo3 tg128=20.13 turbo4 tg128=20.17 Refs: TheTom/llama-cpp-turboquant issues #50, #64, #81 fix(metal): add turbo2/3/4 types to FLASH_ATTN_EXT and CPY support checks fix nix build: Add spirv-headers to vulkanBuildInputs ref: https://github.com/TheTom/llama-cpp-turboquant/issues/81 spirv-headers were introduced by https://github.com/TheTom/llama-cpp-turboquant/commit/1f30ac0ceac0e2b4400069d81857089b6e04872a but not added to the nix build environment fix(cuda): add F16-K + TURBO-V dispatch cases in fattn.cu The flash attention vector dispatcher was missing instance files, extern declarations, and dispatch cases for F16-K + TURBO-V combinations. Any of `-ctk f16 -ctv turbo{2,3,4}` aborted at fattn.cu:339 on sm_89 and sm_121 because no dispatch case matched. The kernel already treats TURBO types as unquantized alongside F16 and BF16 (fattn-vec.cuh:79-80), so this is a dispatch-plumbing gap, not a kernel limitation. Pure additive change. CMakeLists.txt is updated so builds without GGML_CUDA_FA_ALL_QUANTS (the default) also link the new template instances. Validated on: - RTX 4090 (sm_89) with Qwen3-30B-A3B Q4_K_M, FA_ALL_QUANTS=ON - DGX Spark GB10 (sm_121) with Q4_K_M and UD-Q4_K_XL, both FA_ALL_QUANTS=ON and =OFF Key results: - Pre-fix crash reproduced at fattn.cu:339 (exit 134 SIGABRT, sm_89) - Post-fix: all three new combos (f16/turbo{2,3,4}) run in llama-bench on sm_89 - sm_89 tg128: 215-218 t/s for new combos (vs 242 t/s for f16/f16 baseline) - sm_121 tg128: 83.78-85.11 t/s for f16/turbo3 across model and build-config combinations - Single-run PPL on sm_89 wikitext-2: 7.5281 vs 7.4950 f16/f16 baseline (delta 0.44%, within within-run error) - KV context memory at 4K ctx on sm_89: 229 MiB vs 384 MiB f16/f16 baseline Fixes #83 fix(cuda): allow f16/bf16 + q8_0 mixed KV without GGML_CUDA_FA_ALL_QUANTS (#82) The FA dispatcher rejected any K != V type combo unless all types were in the turbo+q8_0 set. This meant common configs like `-ctk f16 -ctv q8_0` fell back to CPU unless built with -DGGML_CUDA_FA_ALL_QUANTS=ON. The vec template instances for f16/bf16 + q8_0 are already compiled (fattn-vec-instance-{f16,bf16}-q8_0.cu and their reverse), so the dispatcher was gating kernels that do exist. Extend the predicate to include f16 and bf16 alongside turbo + q8_0. Reported by @dentity007 on sm_89 (RTX 4090) and sm_121 (GB10), where `-ctk f16 -ctv q8_0` showed 340x slowdown indicative of CPU fallback. Co-Authored-By: tturney@psyguard.ai Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: add TURBO2_0 to flash_attn auto-enable check turbo2 V cache failed with "failed to create context" because the auto-enable predicate only listed turbo3/turbo4. Without auto-enable, the subsequent quantized-V-requires-FA check hard-fails. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> ci: fix turbo build and test failures Four changes that together make the TheTom/llama-cpp-turboquant CI green on top of feature/turboquant-kv-cache: 1. ggml-cpu/ops.cpp:ggml_compute_forward_clamp The abort branch's -Werror=switch was missing the three TurboQuant KV cache types (TURBO2_0/TURBO3_0/TURBO4_0), which broke every GCC build after the turbo types were added. Added them next to the TQ* cases so the fall-through to GGML_ABORT stays consistent. 2. ggml-rpc.h:RPC_PROTO_PATCH_VERSION / static_assert GGML_OP_TURBO_WHT bumped GGML_OP_COUNT from 96 to 97 but the RPC static_assert was not updated. Bumped the patch version 1->2 and the assert to 97 so ubuntu-latest-rpc / windows-latest-hip / etc. build. 3. tests/test-quantize-fns.cpp a) Heap-buffer-overflow (ASan trace): the scratch buffers std::vector<uint8_t> tmp_q{,1,2}(2*test_size) are too small when vec_dot_type == GGML_TYPE_F32 (which is the case for turbo quants), because from_float then writes test_size*sizeof(float) bytes into them. Size buffers to std::max(2*test_size, test_size*sizeof(float)) so every quant's from_float fits. b) Skip TURBO2_0 / TURBO3_0 / TURBO4_0 in the harness. Their dequantize_row_turbo*_0 intentionally stays in the WHT-rotated domain; the inverse WHT is applied separately via GGML_OP_TURBO_WHT in the attention graph, so a round-trip against the original float buffer is not meaningful and the existing threshold table does not apply. c) Add TQ3_1S to the 3-bit threshold buckets (absolute error uses MAX_QUANTIZATION_TOTAL_ERROR_3BITS, dot product uses MAX_DOT_PRODUCT_ERROR_LOWBIT) so the 3-bit weight quant is held to 3-bit error budgets instead of the default 4-bit ones. 4. ggml-cuda/turbo-quant.cuh Wrap all cudaMemcpyToSymbol / cudaMemcpyFromSymbol / cudaDeviceSynchronize calls in CUDA_CHECK(...) so the HIP build (hipError_t is [[nodiscard]]) no longer trips -Werror on unused-result. Verified locally on Arch Linux (CPU build): ctest -L main -E "test-llama-archs|test-tokenizers-ggml-vocabs" => 100% tests passed, 0 tests failed out of 41 (the excluded tokenizer test needs the model-fetch fixture) test-quantize-fns runs clean under AddressSanitizer as well. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Fix GGML_OP_COUNT assertion for RPC When compiling with the `-DGGML_RPC=ON` option, a static assert fails for `GGML_OP_COUNT == 96`. It should be 97. Fix memory explosion on Apple Silicon Add GitHub Sponsors funding link vulkan: add turbo3 backend tests - test_turbo_wht: forward/inverse WHT, 18 configs. NMSE tolerance 1e-5 (f32 SIMD reduction order varies across GPU backends). - test_turbo_wht_roundtrip: forward then inverse recovers original, 9 configs. NMSE tolerance 1e-5. - test_set_rows_turbo3: full quantization round-trip at small and large tensor sizes. Large tensors exercise the 2D dispatch grid. 21 configs. - Existing: test_turbo_wht (18), FA with turbo3 KV (528). - Total: 576 tests. vulkan: fix and complete turbo3 KV cache support feat: add CDNA4 (gfx950/MI355X) support + test results Add AMD Instinct MI355X (gfx950) architecture support: Code changes: - vendors/hip.h: Add CDNA4 define for __gfx950__, include in CDNA family - common.cuh: Add GGML_CUDA_CC_CDNA4 constant and IS_CDNA4 macro - mma.cuh: Route CDNA4 to compatible MFMA instructions * bf16: mfma_f32_16x16x16bf16_1k (same as CDNA3) * int8: mfma_i32_16x16x32_i8 (same as CDNA3) * f32: mfma_f32_16x16x4f32 (CDNA2 path, NOT xf32 which doesn't exist on gfx950) - mmq.cuh: Include CDNA4 in stream-k dispatch - common.cuh: Exclude CDNA4 from CDNA3-specific e4m3_fnuz FP8 path (gfx950 uses standard e4m3fn) MI355X test results (Qwen2.5-1.5B Q4_K_M, single GPU): - turbo3: 39,140 tok/s prefill (98% of f16), 162 tok/s decode (64%) - turbo4: 39,232 tok/s prefill (98% of f16), 214 tok/s decode (84%) - WHT roundtrip: PASS (max error 2.98e-07) Note: non-FA MMQ path crashes on gfx950 (xf32 MFMA unsupported). TurboQuant types force FA and work correctly. Tested-by: Andy Luo <andyluo7@users.noreply.github.com> docs: add AMD Instinct MI300X (gfx942) ROCm test results TurboQuant KV cache compression (turbo2/turbo3/turbo4) builds and runs correctly on AMD Instinct MI300X with ROCm 7.0.2. Zero code changes required тАФ existing CUDA kernels compile via HIP translation. Test results (Qwen2.5-1.5B Q4_K_M, single MI300X): - WHT roundtrip: PASS (max error 2.98e-07) - turbo3 prefill: +3% vs f16 (25,200 vs 24,453 tok/s) - turbo3 decode: 88% of f16 (160 vs 181 tok/s) - turbo4 prefill: +4% vs f16 (25,427 vs 24,453 tok/s) - turbo4 decode: 89% of f16 (161 vs 181 tok/s) MI355X (gfx950) compiles but needs gfx950 added to llama.cpp's MMQ kernel dispatch (upstream issue, not TurboQuant-specific). Tested-by: Andy Luo <andyluo7@users.noreply.github.com> metal: add TurboFlash attention kernel for turbo3 KV cache decode Two-pass block-parallel attention kernel optimized for turbo3 V cache decode on Apple Silicon Metal. Supports both q8_0-K (asymmetric) and turbo3-K (symmetric) configurations via compile-time function constant. Architecture: - Pass 1: 32-thread SIMD group per (query-head, block) pair - Each lane handles DK/32 interleaved dimensions - Q loaded to per-lane registers, K dequant via q8_0 or turbo3 path - K scoring via simd_sum dot product - turbo3 V unpack with register codebook (8 centroids) - Online softmax (m/l/o state) entirely in registers - Zero shared memory in pass 1 - Pass 2: merge partial results across blocks - Online softmax correction with global max/sum - Inverse WHT via simd_shuffle_xor (stages 0-4) + shared memory (stages 5-6) - Eliminates 5 of 7 threadgroup barriers vs naive butterfly Auto-detection: activates for single-token decode (ne01==1) when V is turbo3 and K is q8_0 or turbo3. Controllable via TURBO_FLASH env var (0=off, 1=force). Block size B=64 (proven optimal on Apple Silicon). Benchmarks (Qwen2.5-7B Q8_0, asymmetric q8_0-K/turbo3-V): - M5 Max 128GB: +1.5% decode at 8K (56.82 vs 56.00 tok/s), 93% of q8_0 - M2 Pro 32GB: +0.6% decode at 8K (20.55 vs 20.42 tok/s) - Advantage scales with context (+7.3% at 32K) Inspired by Eric Kryski's TurboFlash architecture (mlx-swift-lm). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: Vulkan compute shader support for turbo3 KV cache Full turbo3 quantize/dequant pipeline for Vulkan backend: - types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4]) - dequant_turbo3_0.comp: standalone dequant shader (3-bit index reconstruction from 2-bit qs + 1-bit signs, centroid lookup) - dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths - dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support - copy_to_quant.comp: quantize function with norm correction - vulkan-shaders-gen.cpp: turbo3_0 type registration - ggml-vulkan.cpp: pipeline creation and supports_op dispatch Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: AMD/RDNA4 arch dispatch тАФ scalar half path for TQ4_1S on AMD GPUs The dp4a int8 kernel is optimized for NVIDIA Turing+ dp4a throughput (240 t/s on 5090). On RDNA4, sudot4 has different throughput characteristics and the q8_1 activation quantization adds overhead, causing a regression vs the V12 float kernel (101 vs 135 t/s on RX 9070 XT). Fix: check GGML_CUDA_CC_IS_AMD(cc) at dispatch time and route AMD GPUs to a scalar half-precision kernel (same pattern as TQ3_1S). NVIDIA continues using the dp4a path. Changes: - Add mul_mat_tq4_1s_scalar_multi kernel: pre-rotated half activations, shmem centroid LUT, scalar dot product (no dp4a/byte_perm) - Dispatch: use_dp4a = !AMD && TQ4_1S. AMD falls through to scalar path. - LAUNCH_SCALAR macro unifies TQ4_1S/TQ3_1S scalar dispatch Expected RDNA4 result: restore V12-level decode (135 t/s, 130% of Q8_0) instead of dp4a regression (101 t/s, 60% of Q8_0). Co-Authored-By: Tom Turney <tturney1@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility feat: multi-token TQ4_1S dp4a kernel + multi-GPU fix + static build fix - Multi-token dp4a kernel for ne[1]тЙд8 (speculative decoding, small batches) loads weight data once per block, reuses across all ncols_dst tokens - Runtime TQ4_1SтЖТfp16 dequant + cuBLAS for ne[1]>8 prefill - Fix multi-GPU crash: replace static global CUDA buffers with per-device pool allocations from ctx.pool(id), matching mmvq.cu pattern - Fix static build: TURBO_IQ_API wrapped in #ifdef GGML_BACKEND_SHARED perf: warp-cooperative TQ4_1S dequant (16├Ч less compute per block) Replace per-element generic dequant template (which repeats the full 32-element WHT butterfly 16 times per block) with a warp-cooperative version using __shfl_xor_sync. One WHT per block instead of 16. Note: this improves the dequant kernel itself but doesn't fix the prefill gap (5.9K vs 13.3K). The bottleneck is cuBLAS fp32 GEMM vs the q8_0 conversion path's native int8 tensor core GEMM. The dequant was never the slow part тАФ the GEMM dispatch is fundamentally different. For prefill-heavy workloads, load-time q8_0 conversion remains the recommended path (default ON). GGML_TQ_NATIVE=1 for decode-heavy interactive chat where the +29% decode speed matters more. perf: TQ4_1S native kernel 3.5├Ч faster тАФ 240 t/s (was 68), smaller VRAM than q8_0 Autoresearch-discovered optimizations for TQ4_1S weight mul_mat_vec kernel. Native TQ4_1S at 5.0 bpv now runs 36% FASTER than the q8_0 load-time conversion (240 vs 176 t/s) while using 1.7├Ч LESS VRAM (4.5 vs 7.5 GiB). Key optimizations (found via 86 automated experiments): 1. fp16 activation buffer тАФ halves activation bandwidth (the bottleneck) 2. Shared-memory centroid LUT тАФ eliminates constant memory serialization on divergent lane access (+89% single change) 3. Half2 arithmetic + strided block processing тАФ 2├Ч arithmetic density 4. Vectorized 128-bit loads тАФ uint32├Ч4 weights, int4 activations (+45%) 5. Register __byte_perm centroid decode тАФ zero-memory centroid lookup 6. NWARPS 8тЖТ4 Also: - Load-time q8_0 conversion now opt-in (GGML_TQ_CONVERT_Q8=1) instead of default. Native kernel is strictly better on both speed and VRAM. - Autoresearch harness gains coherence testing (server API + factual Q&A) to catch silent corruption that PPL alone misses. Benchmarks (RTX 5090, Qwen2.5-7B-Instruct TQ4_1S): Upstream V12 runtime: 67 t/s (4.5 GiB VRAM) q8_0 conversion: 176 t/s (7.5 GiB VRAM) Native optimized: 240 t/s (4.5 GiB VRAM) тЖР this PR Quality (vs f16 baseline): PPL: 7.54 (f16: 7.18, q8_0 conv: 7.55) Mean KLD: 0.056 (q8_0 conv: 0.057, q4_0: 0.078) NIAH: 5/5 Coherence: 4/4 (Paris, 4, print, Shakespeare) fix: cap map0 kernel shmem for 256-expert MoE models Graph reservation passes worst-case ne20=ne02 (256x256x2=128KB), exceeding the 32KB threadgroup memory limit on Apple Silicon. At runtime ne20 is the actual n_expert_used (e.g. 8), so shmem = 256*8*2 = 4KB, well within limits. Cap the reservation shmem to 32KB to prevent the assert from firing. Tested on Qwen3.5-35B-A3B (256 experts) with llama-server + flash attention тАФ previously crashed during warmup, now runs at 22 t/s. Fixes #58 Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: add MoE expert count kernel instantiations + TQ4_1S backend tests Metal MoE support: - Add kernel_mul_mm_id_map0 instantiations for ne20 = 32, 60, 64, 128, 160, 256 - Covers Yuan, Qwen1.5-MoE, OLMoE, Qwen2/3-MoE, Mistral Small 4, Llama 4 Maverick, DeepSeek-V2/V3, Qwen3.5-35B/122B - Note: ne02=256 (Qwen3.5-35B-A3B) hits shmem assert in llama-server with flash attention тАФ needs chunked map0 dispatch (follow-up) Backend tests: - Add TQ3_1S and TQ4_1S to all_types array in test-backend-ops - Enables GET_ROWS and MUL_MAT coverage for WHT-rotated weight types Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: GCC double extern in ops.cpp turbo3_cpu_wht_group_size extern "C" GGML_API creates double extern on paths where GGML_API expands to 'extern'. Wrap in extern "C" {} block instead. Reported by Madreag on RTX 5090 WSL2. Co-Authored-By: Tom Turney <tturney1@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Update GGMLQuantizationType and LlamaFileType enums to include TQ3_1S and TQ4_1S quantization types with corresponding sizes in GGML_QUANT_SIZES. Enhance Metal operations for TQ weights and concurrency handling for Gemma 4 - Introduced a new function to check if TQ source tensors mutate during operations. - Updated matrix multiplication logic to handle TQ weights more effectively, ensuring correct concurrency behavior. - Adjusted Metal kernel definitions to support TQ weights with improved dispatch parameters. - Enhanced comments for clarity on concurrency issues related to TQ weights. fix: remove redundant extern from GGML_API macro (GCC 13.3 hard error) GCC 13.3 on Ubuntu 24.04 hard errors on the redundant `extern` in the GGML_API macro when used inside `extern "C" {}` blocks. MSVC and Clang accept it silently. Reported independently by joemc1470 (RX 6600 HIP) and christopheraleman1015 (WSL2 Ubuntu). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: load-time TQ4_1S -> q8_0 conversion for CUDA dp4a speed Cherry-picked from signalnine (daf0484, dca057a). TQ4_1S tensors are converted to q8_0 at model load via fused CUDA kernel. Transparent to downstream code. 40% smaller on disk, TQ4_1S quality, q8_0 decode speed. signalnine's results (RTX 5090): 105 t/s converted vs 103 t/s native q8_0. PPL 7.608 vs 7.599 (0.009 requantization rounding). Also includes group_size=32 for ggml_turbo_wht (needed for TQ weight types). Metal regression: clean (tested Q8_0, Config I, turbo3 KV on M5 Max). Co-Authored-By: Gabe Ortiz <gabe@signalnine.net> Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Fix turbo4 C reference WHT dequant mismatch (#43) Fix/turbo4 wht dequant fix: CPU vec_dot heap allocation for turbo/TQ types (n > 4096 models) Stack-allocated float tmp[4096] buffers in CPU vec_dot functions crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632, Qwen 27B 18944). Replaced with heap allocation. Affects CPU-only inference fallback path. GPU users unaffected. Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced CPU fallback. Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944. No crash, no assert. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: add dk512 Metal FA kernel instances for turbo types (Gemma 4 support) Gemma 4 uses global_head_dim=512 for full attention layers. The turbo FA kernels were only instantiated up to dk256 for symmetric and cross-turbo combos. Missing dk512_dv512 caused pipeline compilation failure on Gemma 4 (and any future model with head_dim=512 + turbo KV). Added 18 template instantiations (9 non-vec + 9 vec) for all turbo type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already had dk512 and were not affected. Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric q8_0/turbo4 both produce correct bench results at dk512. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: AMD HIP/ROCm build support for TQ4_1S weight compression - Add missing CUDA->HIP stream capture API mappings (vendors/hip.h) - Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux) - Add fileno/isatty POSIX compat macros for clang on Windows No kernel changes needed. signalnine's fused mmvq-tq kernel uses __shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4. Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I. Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL. Metal regression: clean, no changes to non-Windows/non-HIP paths. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: Windows MSVC build compatibility for TQ weight types - Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI) - Add GGML_API to turbo3_cpu_wht_group_size for DLL export - Move extern declaration to file scope with extern "C" GGML_API linkage to fix C vs C++ name mangling across DLL boundary All changes are no-ops on Linux/Mac. Fixes MSVC build errors: C2065: 'M_PI': undeclared identifier LNK2001: unresolved external symbol turbo3_cpu_wht_group_size Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> perf: V12 single-phase fused TQ mmvq тАФ shmem activation, no global scratch Replace two-phase approach (separate pre-rotation kernel + global memory round-trip) with single-phase kernel where all 8 warps cooperatively WHT-rotate activation into shared memory, then each warp processes one row reading from shmem (broadcast reads from L1). Key changes: - No global scratch buffer (eliminates CUDA graph incompatibility) - No separate kernel launch for pre-rotation - Activation stays in shmem (~14-32 KB) instead of global memory - Single __syncthreads between rotation and dot product (NOT in inner loop) - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit) This avoids the NR0 regression that killed V3/V6/V11 тАФ those had sync inside the dot product loop. V12's sync is between the two phases. Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates 2x activation bandwidth). Pascal improvement smaller (still bandwidth bound). NEEDS TESTING тАФ apply and benchmark on Ampere/Ada/Blackwell. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: TQ4_1S on MoE models тАФ disable CUDA graphs for TQ MUL_MAT_ID MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the expert routing dispatch. This is incompatible with CUDA graph capture. TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching the existing behavior for non-quantized types. Also: use persistent buffer for activation pre-rotation scratch, with graph-capture-safe check. Tested: Qwen3.5-35B TQ4_1S тАФ 47 t/s decode, PPL 6.42. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> perf: fused TQ4_1S/TQ3_1S mul_mat_vec тАФ 3.4x decode speedup Two-phase approach: pre-rotate activation once via warp shuffle WHT, then simple mmvq kernel reads pre-rotated values (centroid ├Ч scale only, zero WHT per block). Results (Qwen2.5-7B TQ4_1S, RTX 5090): Decode: 20.3 тЖТ 69 t/s (3.4x speedup) vs q8_0 (177 t/s): 39% PPL: 8.82 (identical) Comprehensive optimization log (13 versions tested): cuBLAS baseline: 20 t/s V1 per-warp WHT, 4 warps: 60 t/s (3.0x) V3 shmem activation cache: 33 t/s (syncthreads kills it) V5 multi-warp per row: 62 t/s V6 LUT (shmem): 37 t/s (sync overhead) V7 8 warps clean: 62 t/s V8 pre-rotation (2-phase): 69 t/s тЖР BEST (3.4x) V9 pre-rot + q8_1: 70 t/s (marginal) V10 4-elem/thread: 57 t/s V13 8-elem, 4-thread dot: 45 t/s NR0=2/4/8: all regressed (register spill / cache thrash) The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed int32 processing тАФ TQ4_1S requires float centroid lookup which can't use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from Apple Silicon's cooperative SIMD efficiency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: TQ4_1S CUDA тАФ mmvq exclusion for fused path + quantize tool registration - ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_ mul_mat_vec_q (was missing, causing ABORT in mmvq.cu) - tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types Tested: Qwen2.5-7B TQ4_1S тАФ correct output, PPL 8.82 (+1.1% vs Q2_K), 6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine) fix: disable upstream attn rotation by default (conflicts with TurboQuant) Upstream commit 744c0c7 added graph-level Hadamard rotation for KV cache quantization. This conflicts with our kernel-level WHT rotation and causes graph hash table overflow on Phi-4 and potentially other models. Disable by default since TurboQuant already handles rotation at the kernel level (more efficient, no extra graph nodes). Users can re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: add post-unrotate memory barrier for in-layer mixing safety Without this barrier, the GPU may start executing the next node's matmul while the unrotate kernel is still modifying src1. This causes data corruption when TQ and non-TQ tensors are mixed within the same layer's attention block. The barrier ensures the unrotate completes before any subsequent operation reads from the same activation buffer. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels Add WHT-rotated weight quantization types: - TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW - TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW Both use 32-element Randomized Hadamard Transform with dual half-block scales (d0/d1). Quantization: forward RHT тЖТ scale search тЖТ iterative refinement (6 iter) тЖТ pack indices. Metal optimization (V2.1 fused kernel): - Zero threadgroup memory for rotation (was 20KB+ on large models) - Cooperative SIMD rotation via simd_shuffle_xor (registers only) - Single simd_sum at end (not per-block) - NR0=8 rows per threadgroup (amortizes rotation cost) - Memory barriers between rotate/matmul/unrotate dispatches - MoE MUL_MAT_ID support with rotated expert dispatch Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2 Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE: - 27-41% model size reduction - +1.3-1.9% PPL (Qwen), 94-102% decode speed - NIAH pass, KLD comparable to turbo3 KV - Llama 3.1 70B shows +25% PPL тАФ needs investigation Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Remove unused CENTROIDS_1BIT constant Leftover from 1-bit VX experiment. Causes -Werror build failure in CI. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: add missing TurboQuant FA template instances for HIP/ROCm build The HIP build was missing 9 turbo cross-type flash attention vec instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists. Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP since those instance files are already excluded from the HIP build (they exceed HIP's 65536-byte local memory limit). Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo) Enable Sparse V on all Metal, auto-enable Boundary V for turbo2-V Sparse V: now enabled by default on all Metal (was M5+ only). Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0. Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set. Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V. 37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: CUDA warp-to-block mapping for block_size=128 (turbo3, turbo2) The block_size=128 change (adac2c68d) broke CUDA quantization: with QK=128, blocks_per_group=1, but the warp-cooperative packing still used blk_base+warp_id, causing warps 1-3 to write OOB. Fix: compute elem_in_block = j % QK_TURBO_N and use it for block pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs, elem_in_block / 8 for signs). Works for both QK=32 and QK=128. Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3: PPL = 7.587 (matches QK=32 baseline exactly). Increase turbo3/turbo2 block size from 32 to 128 One norm per rotation group instead of four identical copies. Eliminates 6 bytes of redundant storage per 128-element group. turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression Zero PPL regression validated across: - Asymmetric q8_0-K + turbo{2,3}-V - Symmetric turbo3/turbo3 - Boundary V (LA-V7) - 3 architectures (dense, Qwen, MoE) - 3 context lengths (512, 8K, 32K) - 2 Apple Silicon platforms (M5 Max, M2 Pro) - NIAH 3/3 pass +3-7% decode on tested M2 Pro setup. No regression on M5. Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250 hardcoded FA template nl values. Block size is now a one-line edit in ggml-common.h. Credit to @AmesianX whose block_size=256 CUDA implementation prompted this investigation. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: HIP/ROCm porting for TheTom's turbo3/turbo2 warp-cooperative kernels Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3 flash attention templates to HIP/ROCm (7900 XTX, gfx1100). HIP vendor header fixes: - Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol - Add cudaMemcpyHostToDevice/DeviceToHost mappings - Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync to support both 3-arg and 4-arg calls (CUDA allows defaulting width to warpSize, HIP macros required 4 args) - Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit on wave64 platforms, turbo code expects 32-bit) HIP CMakeLists: - Add turbo3 and turbo2 flash attention template instances (same files as CUDA CMakeLists, were missing from HIP build) Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16) Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug (fixed by TheTom in 53f1298bd). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: KV state serialization uses padded tensor width (issue #28 follow-up) state_write_data and state_read_data used hparams.n_embd_k_gqa (576) for ggml_row_size, but turbo types zero-pad to 640. For turbo4 (QK=128), 576 % 128 != 0 тЖТ ggml_row_size assertion failure during prompt cache save on llama-server slot reuse. Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of hparams values in all four serialization paths (K write, K read, V write, V read). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: Boundary V (experimental) тАФ layer-aware V compression Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var: - Mode 5: first2+last2 V=turbo4, rest V=turbo2 - Mode 6: last8 V=turbo4, rest V=turbo2 - Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2 Mode 7 ("Boundary V") protects quality-sensitive boundary layers with q8_0-V while aggressively compressing middle layers with turbo2-V. K cache is unchanged (stays at whatever -ctk specifies). Validated on Metal (M5 Max) across 4 models, 2 context lengths: - phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742) - Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707) - Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137) - Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273) - 8K context: stable, no collapse - NIAH retrieval: pass - Speed: no penalty Effective V compression between turbo2 and turbo3, closer to turbo2 on deeper models. Quality consistently better than uniform turbo2-V. Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai fix: turbo4 on GLM-4.7 тАФ context init check accounts for zero-padding (issue #28) The block-size divisibility check in llama-context.cpp rejected turbo4 on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128тЙа0) before the KV cache zero-padding code could run. Fix: for turbo types, compute the padded head_dim (ceil to 128) before the divisibility check, matching what llama-kv-cache.cpp actually does. Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: CUDA port of turbo4 (4-bit, 3.8x compression) тАФ fixes issue #25 bug 1 Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format (16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on CUDA with "cannot run the operation (SET_ROWS)". Changes TURBO4_USE_4BIT default from Metal-only to all backends. The 4-bit format (16 centroids) has better quality than the legacy 3-bit+QJL format and is simpler to implement (no residual projection). Full CUDA stack: - turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid, dequant element, per-block quantize - set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation, 4-bit quantize, nibble pack via warp shuffle, corrected norm) - dequantize.cuh + convert.cu: turbo4 to f16/f32 - fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4 - fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances (turbo4├Чturbo4, turbo4├Чq8_0, turbo4├Чturbo3, turbo4├Чturbo2) - ggml-cpu.c: CPU FA vec_dot for turbo4 PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8├Ч compression Speed: 217 t/s decode (comparable to turbo3 222 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: add turbo3/turbo2 cross-type VEC FA instances (issue #25 bug 2) Mixed turbo3-K/turbo2-V and turbo2-K/turbo3-V had no CUDA FA kernel instances, causing ~11x prefill regression (falling back to CPU FA). Added VEC template instances for both cross-type pairs at D=64/128/256. Updated the mixed-type guard in get_best_fattn_kernel to allow any combination of turbo2, turbo3, and q8_0. Tested: turbo3/turbo2 and turbo2/turbo3 both run at full CUDA VEC speed (~170 t/s prefill, ~221 t/s decode on Qwen3.5 35B). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> perf: CUDA MMA flash attention for D=640 (GLM-4.7 turbo3: 37тЖТ192 t/s) GLM-4.7 Flash with turbo3 KV cache pads head_dim 576тЖТ640 for WHT. D=640 had no CUDA FA kernel, falling back to CPU FA at 37 t/s. Added MMA flash attention configs for D=640 (DKQ=640, DV=512), using identical tile sizes as D=576 (nbatch_K2=288, nbatch_V2=256). ncols1 capped at 2 for D=640: at ncols1тЙе4 the Q shared memory (ncols ├Ч (DKQ/2+4) ├Ч 4 = 83KB at ncols=64) exceeds the per-block limit. ncols1тЙд2 keeps total shared memory under 80KB. Changes: - fattn-mma-f16.cuh: D=640 config entries + extern declarations (ncols2=16 only, ncols1тИИ{1,2,4}) - fattn.cu: case 640 in kernel selection + simplified MMA dispatch (always ncols2=16, ncols1 capped at 2) - fattn-tile.cuh/cu: D=640 tile configs + dispatch - 3 MMA template instances + 1 tile instance Tested: GLM-4.7 Flash: 192 t/s (was 37), PPL 17.05 (vs 14.97 f16) Qwen3.5: 222 t/s, PPL 6.31 (unchanged) DeepSeek: 143 t/s, PPL 11.61 (unchanged) 16/16 coherence, NIAH 9-10/11 at 32K Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: zero-pad non-128 heads for full 7-stage WHT (replaces q8_0 fallback) Instead of falling back to q8_0 for models with non-128-aligned heads, pad each head to the next multiple of 128 before WHT rotation. The padded elements are zero and don't affect dot products since WHT preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>. This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and other non-128 head_dim models. Padding overhead is wasted bits on the zero-padded quantized elements: head_dim=192 тЖТ 256 (33% overhead, still 3.5x compression) head_dim=576 тЖТ 640 (11% overhead, still 4.1x compression) Changes: - llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad k_cur/v_cur per-head via ggml_pad before set_rows, return padded views from get_k/get_v - llama-graph.cpp: pad Q per-head before turbo_wht, extract original V head_dim from attention output after inverse WHT. All three build_attn variants (KV, K-only/MLA, ISWA) updated. PPL (wikitext-2, 512 context, 8 chunks): DeepSeek (192тЖТ256): 11.61 vs 9.90 baseline (+17%) Previously: 344,304 with 64-group WHT (catastrophic) GLM-4.7 (576тЖТ640): 9.11 vs 14.97 baseline Qwen3.5 (128): 6.31 unchanged Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: asymmetric K/V support + q8_0 ├Ч turbo FA kernel instantiations Add full asymmetric K/V quantization support for Metal flash attention: - Pipeline naming uses k{type}_v{type} format for all FA kernels (335 total), eliminating underscore ambiguity in type names - 90 turbo ├Ч turbo asymmetric instantiations (turbo2/3/4 all combinations) - 150 q8_0 ├Ч turbo asymmetric instantiations (both directions, all head dims) - Gatekeeper and assertion updated to allow turbo ├Ч turbo and q8_0 ├Ч turbo pairs - Zero regression on existing symmetric paths (validated across 4 models, 2 machines) The q8_0 ├Ч turbo kernels fix a silent dispatch failure where mixed q8_0-K + turbo-V configs would NaN (turbo4-V) or fall to undefined paths (turbo3-V). This enables the asymmetric quality rescue: q8_0-K + turbo-V recovers near-baseline PPL on low-bit models where symmetric turbo-K degrades. Validated on Metal (M2 Pro + M5 Max): - phi-4-Q8_0: symmetric turbo3 +4.2%, turbo4 +1.7% (no regression) - Qwen2.5-7B Q4_K_M: q8_0-K + turbo4-V +1.0%, q8_0-K + turbo3-V +2.0% (rescued) - Qwen3.5-35B MoE, 27B Dense, Mistral-24B: all healthy (no regression) - Cross-hardware M2/M5 parity confirmed on all tested configs Closes #27 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai feat: asymmetric K/V quant support for Metal flash attention - Add mixed-type kernel instantiations for all turbo pairs (turbo2/3/4) - Both vec and non-vec FA paths, all supported head dims - Pipeline naming uses k/v prefix format (kturbo3_vturbo4) to avoid underscore ambiguity in type names - Relax symmetric K/V assertion to allow turbo mixed pairs - Update pipeline name construction in both FA pipeline getters - 90 new kernel instantiations (54 non-vec + 36 vec) - All 335 FA kernels use unified k{type}_v{type} naming scheme Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai feat: Metal support for turbo2 (2-bit KV cache, 6.4x compression) 4 centroids, no signs byte. Simpler than turbo3. PPL 6.507 (+6.48% vs q8_0) on Qwen3.5-35B-A3B. Metal shader: quantize, dequant (vec + non-vec), SET_ROWS kernel, FA template instantiations (non-vec nl=2, vec nl=8). Runtime: supports_op, vec/non-vec gating, llama-bench parser. Codex-reviewed: no bugs found, only known limitation is non-128 head_dim tail handling (same as turbo3, covered by q8_0 fallback). Closes #19 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai fix: require head_dim % 128 for turbo KV тАФ fall back to q8_0 otherwise PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads) causes catastrophic quality loss on some models: - DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline - GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%) The 64-group WHT passed NIAH/coherence but the weaker decorrelation (6 stages, 3 groups per 192-dim head) doesn't preserve statistical quality. Straight PolarQuant (no WHT) is even worse. All turbo types now require head_dim divisible by 128. Models with non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to q8_0 with a warning. The 64-group WHT code remains for future investigation but is not used. PPL summary (Qwen3.5, head_dim=128, wikitext-2): f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> perf: sparse V dequant тАФ skip negligible attention weights in VEC kernel In the VEC flash-attention kernel, skip V dequantization for KV positions where ALL attention weights are below 1e-6. At long context most positions contribute noise, not signal тАФ skipping their V dequant saves compute with zero quality impact. Both half2 and f32 paths are covered. The threshold (1e-6) matches the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below this are flushed to zero. Benefit scales with context length тАФ at short context nearly every position contributes so no skip occurs. At 512K+ most positions are skipped during generation. Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence. Generation at 512K: 25 t/s, prefill: 2535 t/s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: InnerQ per-channel equalization + turbo2 64-group fallback InnerQ equalizes K channel variances before WHT rotation to reduce quantization error on models with anisotropic K distributions. Enabled via TURBO_INNERQ=N env var (N = calibration token count). Pipeline: calibrate per-channel K┬▓ stats тЖТ compute scales тЖТ apply scale to K before WHT (SET_ROWS) тЖТ apply 1/scale to Q and V output via ggml_turbo_wht op (scale passed as src[1]). Math: <Q/s, s*K> = <Q, K> preserves dot products. Key design: - d_innerq_active starts at 0 (CUDA zero-init) тАФ kernel never multiplies by uninitialized scales - Auto-disables for 64-group models (group_size < 128) where channel mapping is not 1:1 across WHT groups - Auto-disables when max channel ratio < 1.2 (already balanced) - Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv tensor updates between SET_ROWS and graph-side WHT - Scale_inv tensor stored in KV cache, initialized to identity Also: turbo2 (2-bit) now requires head_dim divisible by 128. At 64-group WHT, the 4-centroid codebook doesn't have enough resolution тАФ DeepSeek (head_dim=192) produced garbage. turbo2 auto-falls back to q8_0 for non-128-aligned heads with a warning. Tested: 18/18 coherence (4 models ├Ч 4+ KV combos), InnerQ active on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: MLA inverse WHT group_size derived from K (not V) тАФ fixes GLM-4.7 The mixed KV types commit changed the inverse WHT guard from checking k->type to v->type. For MLA, V is a view of K's first 512 elements (v->ne[0]=512, 512%128=0 тЖТ group=128). But K was quantized with 64-group WHT (k->ne[0]=576, 576%128тЙа0 тЖТ group=64). The mismatch produced garbage output on GLM-4.7 Flash. Fix: derive group_size from K when K is turbo (since K determines the rotation used during quantization), from V only when K is not turbo. Tested: 16/16 coherence (4 models ├Ч 4 KV combos), including GLM-4.7. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: GGML_TYPE_TURBO2_0 тАФ 2-bit TurboQuant KV cache (6.4x compression) Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value, 6.4x compression vs f16. Same architecture as turbo3 but with a 4-centroid Lloyd-Max codebook instead of 8. Block format: block_turbo2_0 = 10 bytes per 32 values - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm) - qs (8B): 2-bit centroid indices, 4 per byte Usage: -ctk turbo2 -ctv turbo2 Full stack: - ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct - ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize - turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant - set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT) - dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion - fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2 - fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances - Mixed types: turbo2/q8_0 cross-type FA instances - common/arg.cpp: CLI --cache-type-k turbo2 - llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration NIAH (Qwen3.5 35B, RTX 5090, 4K-512K): 4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11 Coherence: 31/31 across 4 models x all KV combos (GPU + CPU) Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: implement CPU turbo3 quantize (was a stub that zeroed qs/signs) quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and zeroed out qs/signs тАФ any CPU-only path (-ngl 0) silently produced garbage output. Now implements full PolarQuant: L2 norm тЖТ normalize тЖТ forward WHT rotation (signs1 тЖТ butterfly тЖТ signs2) тЖТ 3-bit centroid quantize тЖТ pack qs/signs тЖТ corrected norm (grp_norm / recon_norm). WHT group size (128 or 64) is communicated from the CPU SET_ROWS handler via a global variable, read from the op_params that llama-kv-cache.cpp sets based on head_dim. This handles models with different K and V head dims (e.g. DeepSeek2 K=192тЖТ64, V=128тЖТ128). Tested CPU-only (-ngl 0): - Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s - DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s - GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: mixed turbo3/q8_0 KV cache types (-ctk turbo3 -ctv q8_0 and vice versa) Users can now mix turbo3 and q8_0 independently for K and V caches: -ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality -ctk q8_0 -ctv turbo3 # opposite tradeoff Changes: - New VEC FA template instances for turbo3/q8_0 cross-type pairs (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu) - fattn-vec.cuh: extern declarations for mixed instances - fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard - llama-graph.cpp: inverse WHT guard now checks V type (not K type) тАФ when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse WHT must not fire. For MLA, V is a view of K so v->type correctly reflects K's turbo type. Tested: - 16/16 coherence tests pass (4 models ├Ч 4 KV combos) - NIAH at 4K+32K: all combos within baseline variance (10-11/11) q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11, turbo3/q8_0: 9/11 (1 extra miss, normal variance) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> feat: 64-element WHT groups + MLA Q rotation fix (issue #13) Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT groups instead of crashing or falling back to q8_0. Key changes: **64-element WHT groups:** - turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64 - set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size from SET_ROWS op_params (set by llama-kv-cache based on head_dim) - turbo-wht.cu: templated on group_size, head-dim-aware dispatch - ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size parameter (0=auto) to handle MLA where output dim differs from K head dim - ops.cpp: CPU WHT parameterized on group_size **MLA fix тАФ missing Q rotation in K-only build_attn:** - build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K> = garbage. Now all three build_attn variants apply Q rotation. - Inverse WHT moved inside build_attn_mha (before v_mla projection) and also added to the non-FA attention path. **Guards + fallback:** - supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64 - llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0 - CPU type_traits_cpu turbo3 entry for CPU FA vec_dot Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s, Qwen3.5 (128) 223 t/s тАФ all producing correct output with turbo3. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: graceful fallback for turbo3 on non-128-aligned head dims (issue #13) Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash, DeepSeek2 MLA with head_dim=192/576) previously crashed with GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention. Changes: - llama-kv-cache.cpp: detect incompatible head_dim at KV cache init, log a warning, and auto-fall back to q8_0 instead of crashing - ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0 - fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D - ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float) so CPU flash attention can handle turbo3 K/V for models where CUDA FA returns NONE (e.g. D=192 which has no CUDA FA kernel for any type) - set-rows.cu: add tail kernel for non-128 remainder (future use) - turbo-wht.cu: head-dim-aware processing with tail pass-through - ops.cpp: CPU WHT head-dim-aware with tail identity copy Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3 and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128) continues to use turbo3 at full speed (223 t/s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> fix: graceful fallback for turbo3 with non-128-aligned head dims (issue #13) Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) тАФ e.g. GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 тАФ previously hard-crashed with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu. Fix: - supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0, so llama.cpp falls back to an unquantised KV path instead of asserting - get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when K head dim is not 128-aligned (VEC kernel only instantiated for DтИИ{64,128,256}) Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned heads silently use f16 KV while 128-aligned heads continue to use turbo3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> fix: VEC flash-attn Q/K stride mismatch in vec_dot_fattn_vec_KQ_turbo3_0 The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by `nthreads` (8), but the Q register file is loaded in blocks of `nthreads*cpy_ne` (32) elements per thread тАФ the same pattern used by the f16/bf16 VEC kernels. This caused thread t>0 to pair K element (16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete index mismatch. Every generated token had garbage attention scores. Fix: match the f16 kernel pattern тАФ step by nthreads*cpy_ne, add an inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as Q_v[k_KQ_0/nthreads + k_KQ_1]. Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp that documented the symptom of this bug. Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M: - PPL (wikitext-2): 6.2023 тЖТ 6.2996 (+1.57%, within 5% target) - NIAH: 11/11 at 4KтАУ256K (matches q8_0; was 0/11 before fix) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> perf: parallel k_set_rows_turbo3 + optimise KQ/V dequant тАФ +31% decode (71тЖТ94 t/s, 98.5% of F16) k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel gave 3.1% GPU utilisation (36.5 ┬╡s ├Ч 80 calls/token = ~21% of decode budget). Replace with a fully parallel kernel тАФ 1 block per 128-element group, 128 threads per block (one thread per element): тАв Shared-memory WHT butterfly (7 stages, no atomics) тАв Warp-reduce L2 norm + inter-warp accumulate via smem тАв qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync тАв Reconstruction norm same pattern; one write per sub-block (warp lane 0) Also tighten flash-attention dequant paths (fattn-common.cuh): тАв vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same turbo3 block тАФ load qs/signs once instead of twice per pair тАв dequantize_V_turbo3_0: ne==4 fast path тАФ load one qs byte and one signs byte for all 4 elements; unrolled float2 / half2 output pairs Benchmark (Qwen3.5 35B, RTX 5090, tg128): Before: 71.86 t/s (0.75├Ч q8_0) After: 94.04 t/s (0.985├Ч q8_0, within measurement noise of parity) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> perf: enable MMA/TILE flash attention for turbo3 тАФ 0.97x q8_0 prefill Remove the 5-line early-return that forced turbo3 onto the VEC flash attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1) via the existing dispatch logic, but prefill now goes through the Turing MMA kernel (RTX 5090 is SM 12.0 >> 7.5). launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 тАФ added in the original CUDA port commit тАФ provide that conversion automatically. Stride recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). тЬУ Before (VEC only): After (MMA for prefill): 2K prefill: 5032 t/s (0.73├Ч q8_0) 6734 t/s (0.98├Ч q8_0) 8K prefill: 3110 t/s (0.46├Ч q8_0) 6613 t/s (0.98├Ч q8_0) 32K prefill: 1215 t/s (0.19├Ч q8_0) 6168 t/s (0.97├Ч q8_0) Matches Metal M5 Max result (0.99├Ч q8_0 flat across all context sizes). Decode unchanged (VEC, ~0.64-0.82…

github-actions Bot added Nvidia GPU ggml script labels Apr 6, 2026

signalnine marked this pull request as ready for review April 6, 2026 17:23

fix: replace __dp4a with ggml_cuda_dp4a for HIP/ROCm compatibility

097d765

TheTom merged commit a4e8af4 into TheTom:feature/turboquant-kv-cache Apr 7, 2026
12 of 47 checks passed

signalnine deleted the pr/tq4-weight-opts branch April 8, 2026 01:50

Titaniumtown mentioned this pull request Apr 11, 2026

vulkan: TQ4_1s support for model weights #69

Merged

Uh oh!

Conversation

signalnine commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks (RTX 5090, Qwen2.5-7B TQ4_1S)

Speed

Quality (vs f16 baseline PPL 7.18)

Key optimizations (86 automated experiments, 15 kept)

Also includes

Test plan

Uh oh!

signalnine commented Apr 6, 2026

TQ3_1S Regression Check — PASS

Uh oh!

TheTom commented Apr 6, 2026

Regression Test Results

Mac (M5 Max 128GB, Metal)

Windows (GTX 1080 Ti 11GB, CUDA 12.6, SM 6.1)

Uh oh!

signalnine commented Apr 6, 2026

Prefill regression confirmed — root cause identified

Uh oh!

TheTom commented Apr 6, 2026

Uh oh!

TheTom commented Apr 6, 2026

Prefill regression confirmed — root cause identified

Uh oh!

TheTom commented Apr 6, 2026

Uh oh!

signalnine commented Apr 6, 2026

Uh oh!

TheTom commented Apr 6, 2026

Community testing: dual RTX 4090 results

Results

Multi-GPU bug

Fix

Uh oh!

signalnine commented Apr 6, 2026

Uh oh!

TheTom commented Apr 6, 2026

AMD HIP/ROCm testing: RX 9070 XT (RDNA 4, gfx1201)

Build fix required

Regression results (Qwen2.5-1.5B, RX 9070 XT)

Regression vs previous kernel (PR #45 V12)

Uh oh!

signalnine commented Apr 6, 2026

Uh oh!

TheTom commented Apr 7, 2026

RDNA4 arch dispatch fix — cherry-pick ready

What it does

Expected RDNA4 result

Verified

Cherry-pick

Uh oh!

signalnine commented Apr 7, 2026

Uh oh!

TheTom commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheTom commented Apr 7, 2026

RDNA4 retest — RX 9070 XT (gfx1201, HIP SDK 7.1) — ALL PASS

Speed (Qwen2.5-1.5B-Instruct, r=10)

Before vs after fix

PPL (wikitext-2-raw, 512 context, 20 chunks)

Output quality

Note for AMD testers

Uh oh!

Uh oh!

seanrasch commented Apr 8, 2026

Ampere (RTX 3080 Ti, SM 86) — native kernel parity with q8_0 convert

Setup

Results

Interpretation

Uh oh!

TheTom commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

signalnine commented Apr 6, 2026 •

edited

Loading

TheTom commented Apr 7, 2026 •

edited

Loading