feat: TQ4_1S weight compression (Metal only, needs CUDA port) by TheTom · Pull Request #45 · TheTom/llama-cpp-turboquant

TheTom · 2026-04-02T02:20:40Z

Summary

TQ3_1S (3-bit, 4.0 BPW) and TQ4_1S (4-bit, 5.0 BPW) weight quantization using WHT rotation + Lloyd-Max centroids
V2.1 fused Metal kernel: zero threadgroup memory, cooperative SIMD rotation via simd_shuffle_xor, NR0=8
Post-training quantization — no retraining, calibration data, or model modification required
Quantize via llama-quantize --allow-requantize --tensor-type-file config.txt

Tested Models

Model	Config	Size Reduction	PPL Delta	Decode	NIAH
Qwen2.5-1.5B	Config I	-27%	+1.9%	96%	6/6
Qwen3.5-27B	Config I	-28%	+1.3%	99%	3/3
Qwen3.5-35B MoE	Config I	-37%	+1.4%	102%	—
Qwen2.5-72B	Config I	-38%	+3.9%	95%	3/3
Phi-4 14B	Config I	-36%	+1.0%	254%	3/3
Llama 3.1 70B	Premium	-29%	+5.8%	fast	3/3
Llama 3.1 70B	Hybrid	-42%	+16%	133%	3/3

Llama Note

Llama-family models show 6-8x higher per-layer error amplification with WHT-rotated FFN tensors. Use Hybrid (TQ4 attn + Q4_K FFN) or Premium (TQ4 attn + Q5_K/Q6_K FFN) configs. Both beat Q4_K_M in quality and speed at similar size. Full investigation in the paper.

What's needed before merge

CUDA port of V2.1 kernel (calling @signalnine 👀)
HIP/ROCm testing
Regression tests on existing TurboQuant KV functionality
Community validation on untested model families

Metal only

The quantization step (llama-quantize) works on any platform. The runtime dequant kernels are Metal-specific. Compressed GGUFs will not run correctly on CUDA/HIP until those backends are ported.

Paper: https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/weight-compression-tq4.md
Getting started: https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md

🤖 Generated with Claude Code

Adds CUDA dequantization for TQ4_1S (5.0 bpv) and TQ3_1S (4.0 bpv) WHT-rotated weight compression types. These achieve 27-37% model size reduction at +1.0-1.9% PPL on Qwen/Phi families. Base types + Metal + CPU quantize/dequant from TheTom's PR TheTom#45. CUDA additions: - turbo-quant.cuh: weight centroids (N(0,1) Lloyd-Max, 16/8 levels), sign array for 32-element inverse WHT - dequantize.cuh: dequantize_tq4_1s/tq3_1s — full 32-element block inverse RHT (5 butterfly stages + normalize + unsign) - convert.cu: TQ4_1S/TQ3_1S in all 4 dequant dispatchers - ggml-cuda.cu: supports_op for MUL_MAT and GET_ROWS, excluded from mmvq/mmq (uses cuBLAS dequant-to-f16 path) The cuBLAS path is correct for initial support. Future optimization: pre-rotate activations via warp shuffle WHT (same pattern as KV cache Q rotation) to eliminate per-block inverse WHT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TheTom · 2026-04-02T03:18:15Z

Regression Test Results — PR #45

Verified that the TQ4_1S weight compression PR does NOT break existing TurboQuant KV cache functionality or standard inference on non-compressed models.

Hardware: M5 Max (128GB) + Mac Mini M2 Pro (32GB)
Branch: pr/tq4-weight-compression (commit 6c3e503)

Speed — No Regressions

Model	Hardware	Config	pp512	tg128
Qwen2.5-1.5B Q8_0	M5 Max	q8_0/q8_0	10,787	198
Qwen2.5-1.5B Q8_0	M5 Max	q8_0/turbo4	10,460	141
Qwen2.5-1.5B Q8_0	M5 Max	q8_0/turbo3	10,468	138
Phi-4 14B Q8_0	M5 Max	q8_0/q8_0	1,052	33.7
Phi-4 14B Q8_0	M5 Max	q8_0/turbo4	1,051	30.9
Qwen3.5-27B Q8_0	M5 Max	q8_0/q8_0	408	17.6
Qwen3.5-27B Q8_0	M5 Max	q8_0/turbo4	497	17.1
Qwen3.5-27B Q8_0	M5 Max	q8_0/turbo3	487	17.0
Qwen3.5-35B MoE Q8_0	M5 Max	q8_0/q8_0	2,920	76.6
Qwen3.5-35B MoE Q8_0	M5 Max	q8_0/turbo4	2,878	69.6
Qwen2.5-7B Q4_K_M	M2 Pro	q8_0/q8_0	352	34.9
Qwen2.5-7B Q4_K_M	M2 Pro	q8_0/turbo4	351	30.8
Qwen2.5-7B Q4_K_M	M2 Pro	q8_0/turbo3	350	30.0
Qwen2.5-7B Q4_K_M	M2 Pro	turbo3/turbo3	346	26.4

All speeds normal or improved. No regressions.

PPL — No Regressions (full wikitext-2 runs)

Model	q8_0/q8_0	q8_0/turbo4	q8_0/turbo3
Qwen2.5-1.5B	10.31	10.45 (+1.4%)	10.55 (+2.4%)
Phi-4 14B	6.54	6.55 (+0.2%)	—
Qwen3.5-27B	6.87	6.89 (+0.3%)	—
Qwen3.5-35B MoE	6.53	6.56 (+0.5%)	—

All PPL values match known-good. MUL_MAT_ID (MoE path) verified working.

Verdict

ALL TESTS PASS. 5 models, 2 hardware platforms, 4 KV configs. The ggml-metal-ops.cpp restructuring (MUL_MAT + MUL_MAT_ID dispatch) did not break any existing functionality. Safe for review.

turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue #45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

TheTom · 2026-04-02T04:39:17Z

Update: Rebased on upstream master + regression test

Branch force-pushed. Now rebased on latest ggml-org/llama.cpp master (106 upstream commits). 3 commits on top:

187a2c1 feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
9aaa054 fix: add post-unrotate memory barrier for in-layer mixing safety
cb8bddc fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream conflict: activation rotation (commit `744c0c7`)

Upstream added graph-level Hadamard rotation for KV cache quantization (llama : rotate activations for better quantization). This feature:

Crashes on Phi-4 (graph hash table overflow from extra rotation nodes)
Is redundant with our kernel-level WHT rotation (which is more efficient — no extra graph nodes)

Fix: disabled upstream rotation by default in our fork. Users can re-enable with LLAMA_ATTN_ROT_DISABLE=0. Our TurboQuant KV rotation is unaffected.

Regression test (M5 Max, rebased branch `cb8bddc`)

Test	What	Result	Status
Config I quantize + speed	Qwen 1.5B, 202 tg/s	Matches expected	✅
Config I PPL	Qwen 1.5B, 10.77 (8ch)	Within noise	✅
Turbo4 KV speed	Qwen 1.5B, 148 tg/s	Matches expected	✅
Turbo4 KV PPL	Qwen 1.5B, 10.74 (8ch)	Within noise	✅
Phi-4 + turbo4	31.5 tg/s	No crash	✅ Fixed
Large model turbo4	Qwen 27B, 17.4 tg/s	Matches expected	✅

All tests pass. No regressions. Phi-4 crash resolved.

signalnine · 2026-04-02T04:55:04Z

CUDA port available on our branch: signalnine/llama-cpp-turboquant feature/tq4-weight-cuda

What's implemented:

CUDA dequant for TQ4_1S/TQ3_1S (convert path + cuBLAS MUL_MAT)
Fused mul_mat_vec kernel with pre-rotated activations (warp shuffle WHT)
mmvq exclusion for fused dispatch path
llama-quantize registration for TQ4_1S/TQ3_1S types

Results (Qwen2.5-7B TQ4_1S, RTX 5090):

Metric	cuBLAS path	Fused kernel
Decode tg128	20 t/s	69 t/s
vs q8_0 (177 t/s)	11%	39%
PPL	8.82	8.82

The fused kernel pre-rotates the activation vector once per mul_mat via __shfl_xor_sync (5 butterfly stages, 32-element blocks), then the mmvq kernel just does centroid[idx] × scale with no per-block WHT. 13 kernel variants tested — the gap to q4_0 (275 t/s) is from dp4a integer intrinsics that TQ4_1S can't use (centroid lookup requires float). The gap to your Metal (85-99% of q8_0) is from Apple Silicon's cooperative SIMD efficiency.

Happy to iterate on this if you have ideas for closing the CUDA gap further.

TheTom · 2026-04-02T05:19:09Z

This is great work, thank you for turning this around so fast. PPL matching between cuBLAS and fused confirms correctness.

One question before we merge: can you confirm that uncompressed models (q8_0, q4_0, etc.) show no decode regression on this branch? i.e. the new code paths only activate for TQ4_1S/TQ3_1S and existing quant types run at the same speed as before the PR.

TheTom · 2026-04-02T05:25:32Z

CUDA kernel review — performance improvement opportunities

Nice work on the V8 pre-rotation approach. PPL matching confirms correctness. Here's what I see for closing the gap from 39% to 70-85% of q8_0:

High priority (biggest decode wins)

NR0 multi-row CTA with shared activation reuse — Currently each warp handles one row independently, re-reading vy_rot from global memory. On Metal, NR0=8 (8 rows sharing one activation tile) was the single biggest optimization (decode went from 70% to 99% of baseline). On CUDA the benefit is shared reuse of vy_rot across rows per CTA. Realistic target: 69 → 110-140 t/s.
Hot loop load dedup — d0/d1 are loaded per-lane but only have 2 unique values per block. Broadcast via __shfl_sync once per half-warp. Similarly, qs[lane/2] is loaded by both even and odd lanes — load once, broadcast to partner.
__restrict__ qualifiers + vectorized loads — The kernel pointers lack __restrict__ (compiler can't prove no aliasing). Adding it enables better instruction scheduling. Also consider float4 loads for the pre-rotated activation to hit 128-bit memory transactions.

Medium priority

Small-batch kernels (ne[1]=2..8) — Real serving hits batch > 1 frequently. A tuned small-batch path before falling to cuBLAS would help.
CTA shape sweep per architecture — NWARPS=8 was tested on 5090. Ampere/Ada may prefer different occupancy points. Worth parameterizing.
Fused prefill path (MMQ-style) — Currently falls to dequant→cuBLAS for ne[1] > 1. A tiled matmul with shared-memory activation rotation would be faster for medium batch.

Skip / low value

dp4a / fp16 LUT — The pipeline is float activation × float centroid × float scale. Forcing int8 paths is awkward and unlikely to win vs a well-tuned float path.
Scratch buffer cleanup — Lifecycle issue (static cudaMalloc leak), not a throughput lever. Fix eventually but not urgent.

Realistic ceiling

Per architecture with full tuning (NR0 + load dedup + vectorized + batch):

Ampere: ~55-70% of q8_0
Ada: ~60-75%
Blackwell: ~70-85%

The 39% → 70-85% gap is primarily data reuse, not math precision. The pre-rotation design is correct — it just needs the activation tile shared across more rows per CTA.

TheTom · 2026-04-02T05:29:48Z

Full Regression Test — PR #45 (`cb8bddc`)

Hardware

M5 Max 128GB (Apple Silicon)
Mac Mini M2 Pro 32GB (Apple Silicon)

Quantize tool verification

Format	Quantize	Loads	Generates	Status
TQ4_1S Config I	1312 MiB (6.20 BPW)	✅	✅ 198 tg	✅
TQ3_1S attn-only	1730 MiB (8.17 BPW)	✅	✅	✅
Q4_K_M (standard)	1060 MiB (5.00 BPW)	✅	✅	✅

M5 Max — Uncompressed weights + TurboQuant KV

Qwen2.5-1.5B Q8_0

Config	pp512	tg128	PPL (full)
q8_0/q8_0	10,742	209	10.31
q8_0/turbo4	10,604	148	10.45
q8_0/turbo3	10,537	141	10.55

Phi-4 14B Q8_0 (crash fix verification)

Config	pp512	tg128	PPL (full)
q8_0/q8_0	1,083	34.0	6.54
q8_0/turbo4	1,088	31.1	6.55

No crash. Upstream attn_rot disabled by default (commit cb8bddc).

Qwen3.5-27B Q8_0

Config	pp512	tg128	PPL (full)
q8_0/q8_0	554	17.5	6.87
q8_0/turbo4	498	16.9	6.89
q8_0/turbo3	501	16.0	—

Qwen3.5-35B MoE Q8_0 (MUL_MAT_ID path)

Config	pp512	tg128	PPL (full)
q8_0/q8_0	2,837	77.0	6.53
q8_0/turbo4	2,826	77.8	6.56

M5 Max — TQ4_1S Weight Compression

Qwen2.5-1.5B Config I (1.28 GiB, 6.20 BPW)

Config	pp512	tg128	PPL (full)
Config I	7,610	198	10.53
Config I + turbo4 KV	7,394	142	—

Mac Mini M2 Pro — Qwen2.5-7B Q4_K_M

Config	pp512	tg128
q8_0/q8_0	316 ± 4.6	33.6 ± 0.2
q8_0/turbo4	348 ± 0.5	30.3 ± 0.3
q8_0/turbo3	346 ± 0.3	28.7 ± 0.4
turbo3/turbo3	338 ± 1.2	25.7 ± 0.3

Summary

5 models tested across 2 hardware platforms
All PPL values match known-good (within measurement noise)
All speed values normal — no regressions
Phi-4 crash fixed (upstream attn_rot disabled)
MUL_MAT_ID (MoE) verified
Weight compression quantize + inference verified (TQ4_1S, TQ3_1S, Q4_K_M)
TurboQuant KV configs verified (q8_0/turbo4, q8_0/turbo3, turbo3/turbo3)

All tests pass. PR is safe for review.

…bug 1 Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format (16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on CUDA with "cannot run the operation (SET_ROWS)". Changes TURBO4_USE_4BIT default from Metal-only to all backends. The 4-bit format (16 centroids) has better quality than the legacy 3-bit+QJL format and is simpler to implement (no residual projection). Full CUDA stack: - turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid, dequant element, per-block quantize - set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation, 4-bit quantize, nibble pack via warp shuffle, corrected norm) - dequantize.cuh + convert.cu: turbo4 to f16/f32 - fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4 - fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2) - ggml-cpu.c: CPU FA vec_dot for turbo4 PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression Speed: 217 t/s decode (comparable to turbo3 222 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… (issue #28) The block-size divisibility check in llama-context.cpp rejected turbo4 on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128≠0) before the KV cache zero-padding code could run. Fix: for turbo types, compute the padded head_dim (ceil to 128) before the divisibility check, matching what llama-kv-cache.cpp actually does. Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var: - Mode 5: first2+last2 V=turbo4, rest V=turbo2 - Mode 6: last8 V=turbo4, rest V=turbo2 - Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2 Mode 7 ("Boundary V") protects quality-sensitive boundary layers with q8_0-V while aggressively compressing middle layers with turbo2-V. K cache is unchanged (stays at whatever -ctk specifies). Validated on Metal (M5 Max) across 4 models, 2 context lengths: - phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742) - Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707) - Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137) - Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273) - 8K context: stable, no collapse - NIAH retrieval: pass - Speed: no penalty Effective V compression between turbo2 and turbo3, closer to turbo2 on deeper models. Quality consistently better than uniform turbo2-V. Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

…w-up) state_write_data and state_read_data used hparams.n_embd_k_gqa (576) for ggml_row_size, but turbo types zero-pad to 640. For turbo4 (QK=128), 576 % 128 != 0 → ggml_row_size assertion failure during prompt cache save on llama-server slot reuse. Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of hparams values in all four serialization paths (K write, K read, V write, V read). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rnels Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3 flash attention templates to HIP/ROCm (7900 XTX, gfx1100). HIP vendor header fixes: - Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol - Add cudaMemcpyHostToDevice/DeviceToHost mappings - Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync to support both 3-arg and 4-arg calls (CUDA allows defaulting width to warpSize, HIP macros required 4 args) - Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit on wave64 platforms, turbo code expects 32-bit) HIP CMakeLists: - Add turbo3 and turbo2 flash attention template instances (same files as CUDA CMakeLists, were missing from HIP build) Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16) Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug (fixed by TheTom in 53f1298). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@AmesianX

One norm per rotation group instead of four identical copies. Eliminates 6 bytes of redundant storage per 128-element group. turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression Zero PPL regression validated across: - Asymmetric q8_0-K + turbo{2,3}-V - Symmetric turbo3/turbo3 - Boundary V (LA-V7) - 3 architectures (dense, Qwen, MoE) - 3 context lengths (512, 8K, 32K) - 2 Apple Silicon platforms (M5 Max, M2 Pro) - NIAH 3/3 pass +3-7% decode on tested M2 Pro setup. No regression on M5. Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250 hardcoded FA template nl values. Block size is now a one-line edit in ggml-common.h. Credit to @AmesianX whose block_size=256 CUDA implementation prompted this investigation. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The block_size=128 change (adac2c6) broke CUDA quantization: with QK=128, blocks_per_group=1, but the warp-cooperative packing still used blk_base+warp_id, causing warps 1-3 to write OOB. Fix: compute elem_in_block = j % QK_TURBO_N and use it for block pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs, elem_in_block / 8 for signs). Works for both QK=32 and QK=128. Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3: PPL = 7.587 (matches QK=32 baseline exactly).

Sparse V: now enabled by default on all Metal (was M5+ only). Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0. Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set. Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V. 37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The HIP build was missing 9 turbo cross-type flash attention vec instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists. Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP since those instance files are already excluded from the HIP build (they exceed HIP's 65536-byte local memory limit). Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add WHT-rotated weight quantization types: - TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW - TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW Both use 32-element Randomized Hadamard Transform with dual half-block scales (d0/d1). Quantization: forward RHT → scale search → iterative refinement (6 iter) → pack indices. Metal optimization (V2.1 fused kernel): - Zero threadgroup memory for rotation (was 20KB+ on large models) - Cooperative SIMD rotation via simd_shuffle_xor (registers only) - Single simd_sum at end (not per-block) - NR0=8 rows per threadgroup (amortizes rotation cost) - Memory barriers between rotate/matmul/unrotate dispatches - MoE MUL_MAT_ID support with rotated expert dispatch Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2 Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE: - 27-41% model size reduction - +1.3-1.9% PPL (Qwen), 94-102% decode speed - NIAH pass, KLD comparable to turbo3 KV - Llama 3.1 70B shows +25% PPL — needs investigation Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Without this barrier, the GPU may start executing the next node's matmul while the unrotate kernel is still modifying src1. This causes data corruption when TQ and non-TQ tensors are mixed within the same layer's attention block. The barrier ensures the unrotate completes before any subsequent operation reads from the same activation buffer. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…uant) Upstream commit 744c0c7 added graph-level Hadamard rotation for KV cache quantization. This conflicts with our kernel-level WHT rotation and causes graph hash table overflow on Phi-4 and potentially other models. Disable by default since TurboQuant already handles rotation at the kernel level (more efficient, no extra graph nodes). Users can re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…stration - ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_ mul_mat_vec_q (was missing, causing ABORT in mmvq.cu) - tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types Tested: Qwen2.5-7B TQ4_1S — correct output, PPL 8.82 (+1.1% vs Q2_K), 6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Two-phase approach: pre-rotate activation once via warp shuffle WHT, then simple mmvq kernel reads pre-rotated values (centroid × scale only, zero WHT per block). Results (Qwen2.5-7B TQ4_1S, RTX 5090): Decode: 20.3 → 69 t/s (3.4x speedup) vs q8_0 (177 t/s): 39% PPL: 8.82 (identical) Comprehensive optimization log (13 versions tested): cuBLAS baseline: 20 t/s V1 per-warp WHT, 4 warps: 60 t/s (3.0x) V3 shmem activation cache: 33 t/s (syncthreads kills it) V5 multi-warp per row: 62 t/s V6 LUT (shmem): 37 t/s (sync overhead) V7 8 warps clean: 62 t/s V8 pre-rotation (2-phase): 69 t/s ← BEST (3.4x) V9 pre-rot + q8_1: 70 t/s (marginal) V10 4-elem/thread: 57 t/s V13 8-elem, 4-thread dot: 45 t/s NR0=2/4/8: all regressed (register spill / cache thrash) The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed int32 processing — TQ4_1S requires float centroid lookup which can't use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from Apple Silicon's cooperative SIMD efficiency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the expert routing dispatch. This is incompatible with CUDA graph capture. TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching the existing behavior for non-quantized types. Also: use persistent buffer for activation pre-rotation scratch, with graph-capture-safe check. Tested: Qwen3.5-35B TQ4_1S — 47 t/s decode, PPL 6.42. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ratch Replace two-phase approach (separate pre-rotation kernel + global memory round-trip) with single-phase kernel where all 8 warps cooperatively WHT-rotate activation into shared memory, then each warp processes one row reading from shmem (broadcast reads from L1). Key changes: - No global scratch buffer (eliminates CUDA graph incompatibility) - No separate kernel launch for pre-rotation - Activation stays in shmem (~14-32 KB) instead of global memory - Single __syncthreads between rotation and dot product (NOT in inner loop) - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit) This avoids the NR0 regression that killed V3/V6/V11 — those had sync inside the dot product loop. V12's sync is between the two phases. Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates 2x activation bandwidth). Pascal improvement smaller (still bandwidth bound). NEEDS TESTING — apply and benchmark on Ampere/Ada/Blackwell. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI) - Add GGML_API to turbo3_cpu_wht_group_size for DLL export - Move extern declaration to file scope with extern "C" GGML_API linkage to fix C vs C++ name mangling across DLL boundary All changes are no-ops on Linux/Mac. Fixes MSVC build errors: C2065: 'M_PI': undeclared identifier LNK2001: unresolved external symbol turbo3_cpu_wht_group_size Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h) - Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux) - Add fileno/isatty POSIX compat macros for clang on Windows No kernel changes needed. signalnine's fused mmvq-tq kernel uses __shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4. Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I. Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL. Metal regression: clean, no changes to non-Windows/non-HIP paths. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

signalnine · 2026-04-02T18:11:02Z

Tested V12 on RTX 5090 (Blackwell, sm_120). Build aa7c82a, Qwen2.5-7B-Instruct quantized from f16.

Config	PP512 (t/s)	TG128 (t/s)	vs Q8_0 TG
Q8_0 baseline	14,160	173.4	—
TQ4_1S V12 (shmem)	6,330	66.8	38.5%

V12 is within noise of V8 on this card — 66.8 vs 69.2 t/s. The shmem broadcast doesn't help on Blackwell, probably because the 128 MB L2 on the 5090 was already caching the global scratch buffer effectively. The global memory round-trip that V12 eliminates wasn't a bottleneck here.

Interesting that Mario's dual 4090 gets 70.6% of Q8_0 while the 5090 gets 38.5%. The 4090 has ~1 TB/s bandwidth vs the 5090's ~1.8 TB/s, so the 4090 is more compute-bound relative to bandwidth — the centroid lookup cost is a smaller fraction of total time. On the 5090 the kernel is solidly bandwidth-bound but the float activation + float FMA path can't saturate the available bandwidth the way dp4a can.

I've been doing extensive kernel optimization on our branch — tested V8 through V19 plus a TQ4_0 prototype (WHT + uniform q4_0 format with native dp4a). Full optimization log here: https://github.com/signalnine/llama-cpp-turboquant/blob/feature/tq4-weight-cuda/docs/tq4-weight-cuda-optimization-log.md

Key finding: the centroid lookup itself is NOT the bottleneck (confirmed via ablation — replacing TQ4_CENTROIDS_WEIGHT[idx] with (idx-8) gives identical 69 t/s). The real bottleneck is float32 activation bandwidth (4x vs q8_1) and float FMA vs dp4a arithmetic density (4x). TQ4_0 prototype hit 237 t/s by using dp4a, but the quality delta over q4_0 disappears without Lloyd-Max centroids.

TheTom · 2026-04-02T18:40:12Z

Bug: Gemma 4 (head_dim=256) produces garbage on current TOT

Reproduced on Metal (M5 Max). Gemma 4 31B with asymmetric KV (q8_0/turbo) produces garbled multilingual garbage output. This matches the same class of corruption reported in #47 on CUDA.

Not CUDA-specific. Not multi-GPU-specific. It's a head_dim=256 + asymmetric code path bug that affects both Metal and CUDA.

Investigating and fixing now. Will be a separate commit on this PR.

joemc1470 · 2026-04-02T18:42:29Z

HIP/ROCm Test Results — AMD RX 6600 (gfx1032), ROCm 6.4

Hardware: AMD RX 6600 (gfx1032, 8GB VRAM)
ROCm version: 6.4.0
Build env: rocm/dev-ubuntu-24.04:6.4 Docker container on Unraid
Test model: TinyLlama 1.1B Chat Q8_0
Branch: PR #45 weight-compression

Build

✅ HIP build succeeds with one fix required:

ggml/include/ggml.h:184 has a compiler error on GCC 13.3 (Ubuntu 24.04):

error: invalid use of 'extern' in linkage specification
  184 | #define GGML_API __attribute__ ((visibility ("default"))) extern

Fix: remove trailing extern from GGML_API macro (it's redundant inside extern "C" {} blocks):

-#        define GGML_API __attribute__ ((visibility ("default"))) extern
+#        define GGML_API __attribute__ ((visibility ("default")))

After this patch, libggml-hip.so builds cleanly.

Quantization

✅ TQ4_1S compression works correctly:

	Size	BPW
Input (Q8_0)	1114.91 MiB	8.50
Output (TQ4_1S)	668.18 MiB	5.10
Reduction	-40%

Quantize time: 15 seconds on CPU.

Inference — GPU (HIP)

❌ Aborts in ggml_cuda_mul_mat_q

gfx1032 isn't in the rocBLAS TensileLibrary — only gfx1030 ships. Workaround HSA_OVERRIDE_GFX_VERSION=10.3.0 gets past library loading but the HIP matmul kernel still aborts at runtime:

/workspace/ggml/src/ggml-cuda/ggml-cuda.cu:100: ROCm error
_Z19ggml_cuda_mul_mat_qR25ggml_backend_cuda_contextPK11ggml_tensorS3_S3_PS1_

Affects both Q8_0 and TQ4_1S (so not TQ4-specific — base HIP matmul is broken for this arch in the current build).

Inference — CPU (TQ4_1S)

❌ GGML_ASSERT failure — CPU dequant not implemented

/workspace/ggml/src/ggml-cpu/ggml-cpu.c:3464: GGML_ASSERT(n <= 4096) failed
  ggml_compute_forward_mul_mat

CPU inference with Q8_0 baseline works fine (266 t/s pp512, 22.5 t/s tg128), confirming the crash is TQ4_1S-specific — the CPU dequantize/matmul path for the new quant type appears unimplemented.

Summary

Test	Result
HIP build (gfx1032)	✅ with `extern` fix in ggml.h
TQ4_1S quantization	✅ -40% size reduction
GPU inference (HIP)	❌ matmul kernel abort (gfx1032 + ROCm 6.4)
CPU inference (TQ4_1S)	❌ `GGML_ASSERT(n <= 4096)` in ggml-cpu.c:3464
CPU inference (Q8_0 baseline)	✅ 266 t/s pp, 22.5 t/s tg

Happy to test again once the CPU backend and gfx1032 matmul issues are addressed.

…port) Gemma 4 uses global_head_dim=512 for full attention layers. The turbo FA kernels were only instantiated up to dk256 for symmetric and cross-turbo combos. Missing dk512_dv512 caused pipeline compilation failure on Gemma 4 (and any future model with head_dim=512 + turbo KV). Added 18 template instantiations (9 non-vec + 9 vec) for all turbo type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already had dk512 and were not affected. Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric q8_0/turbo4 both produce correct bench results at dk512. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Stack-allocated float tmp[4096] buffers in CPU vec_dot functions crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632, Qwen 27B 18944). Replaced with heap allocation. Affects CPU-only inference fallback path. GPU users unaffected. Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced CPU fallback. Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944. No crash, no assert. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sjoerdmaessen · 2026-04-02T19:54:48Z

Excited to see TQ4_1S weight compression land on CUDA. Happy to benchmark on dual L40S once the CUDA kernel is ready for Ada testing. Interested in both the weight compression decode performance and the combined TQ4_1S weights + turbo KV cache stacking on this architecture. Just let me know.

TheTom · 2026-04-02T19:57:02Z

Thank you @joemc1470 for the thorough report. Great debugging.

CPU assert (n <= 4096): Fixed in commit 21110eb. The CPU vec_dot functions had stack buffers sized for head_dim (128) but your model hits intermediate_size (5632) on the CPU fallback path. Replaced with heap allocation. Please pull TOT and retest.

gfx1032 HIP matmul abort: This is a known upstream issue. The RX 6600 is not in the rocBLAS TensileLibrary (only gfx1030 ships). HSA_OVERRIDE_GFX_VERSION=10.3.0 is the documented workaround but it has known caveats with MMQ kernels. Multiple upstream issues filed (ggml-org#15244, ggml-org#15202, ggml-org#21106). Not something we can fix on our side.

GGML_API extern on GCC 13.3: This is in upstream ggml.h, not our code. The trailing extern in the GGML_API macro is redundant inside extern "C" {} blocks and GCC 13.3 is stricter about it than clang. Your one-line fix is correct. We can carry it as a local patch but the proper fix should go upstream.

TheTom · 2026-04-02T20:00:05Z

It is ready to test now. Pull TOT from the PR branch and follow the quickstart:

https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md#weight-compression-tq4_1s--experimental

For your 122B you will need a custom layer count (48 layers, but only 12 are attention). For a clean first test I would recommend Qwen3.5-27B Q8_0 (64 layers, fits comfortably on dual L40S). The quickstart has copy paste instructions.

Build from pr/tq4-weight-compression branch with -DGGML_CUDA=ON. The fused CUDA kernel and all three bench commands are in the quickstart.

Stacking with turbo KV works. Mario (on x) just confirmed on dual 4090: Config I + KV turbo4, no additional penalty from stacking.

Your Ada L40S numbers would be very valuable. First datacenter GPU weight compression data.

Also would be great to confirm no decode regression on uncompressed models (Q8_0 baseline with and without the PR branch). signalnine has found some issues in decode we need to look into.

github-actions bot added ggml Apple Metal labels Apr 2, 2026

TheTom mentioned this pull request Apr 2, 2026

chore: sync with upstream ggml-org/llama.cpp (106 commits) #46

Merged

4 tasks

TheTom force-pushed the pr/tq4-weight-compression branch from 6c3e503 to cb8bddc Compare April 2, 2026 04:39

github-actions bot added documentation Improvements or additions to documentation Nvidia GPU examples server Vulkan testing devops python script model OpenCL SYCL build nix jinja parser Ascend NPU Hexagon WebGPU labels Apr 2, 2026

signalnine and others added 20 commits April 2, 2026 13:07

Remove unused CENTROIDS_1BIT constant

0c436f2

Leftover from 1-bit VX experiment. Causes -Werror build failure in CI. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: CUDA port of TQ4_1S/TQ3_1S weight dequant (signalnine)

e868127

TheTom force-pushed the pr/tq4-weight-compression branch from 1ca65b5 to 3ae5a42 Compare April 2, 2026 18:19

TheTom mentioned this pull request Apr 2, 2026

Asymmetric q8_0-K / turbo3-V produces corrupt output on Qwen3.5-122B (head_dim=256) #47

Open

TheTom and others added 2 commits April 2, 2026 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: TQ4_1S weight compression (Metal only, needs CUDA port)#45

feat: TQ4_1S weight compression (Metal only, needs CUDA port)#45
TheTom wants to merge 141 commits intofeature/turboquant-kv-cachefrom
pr/tq4-weight-compression

TheTom commented Apr 2, 2026

Uh oh!

TheTom commented Apr 2, 2026

Uh oh!

TheTom commented Apr 2, 2026

Uh oh!

signalnine commented Apr 2, 2026

Uh oh!

TheTom commented Apr 2, 2026

Uh oh!

TheTom commented Apr 2, 2026

Uh oh!

TheTom commented Apr 2, 2026

Uh oh!

signalnine commented Apr 2, 2026

Uh oh!

TheTom commented Apr 2, 2026

Uh oh!

joemc1470 commented Apr 2, 2026

Uh oh!

sjoerdmaessen commented Apr 2, 2026

Uh oh!

TheTom commented Apr 2, 2026 •

edited

Loading

Uh oh!

TheTom commented Apr 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Conversation

TheTom commented Apr 2, 2026

Summary

Tested Models

Llama Note

What's needed before merge

Metal only

Uh oh!

TheTom commented Apr 2, 2026

Regression Test Results — PR #45

Speed — No Regressions

PPL — No Regressions (full wikitext-2 runs)

Verdict

Uh oh!

TheTom commented Apr 2, 2026

Update: Rebased on upstream master + regression test

Upstream conflict: activation rotation (commit 744c0c7)

Regression test (M5 Max, rebased branch cb8bddc)

Uh oh!

signalnine commented Apr 2, 2026

Uh oh!

TheTom commented Apr 2, 2026

Uh oh!

TheTom commented Apr 2, 2026

CUDA kernel review — performance improvement opportunities

High priority (biggest decode wins)

Medium priority

Skip / low value

Realistic ceiling

Uh oh!

TheTom commented Apr 2, 2026

Full Regression Test — PR #45 (cb8bddc)

Hardware

Quantize tool verification

M5 Max — Uncompressed weights + TurboQuant KV

Qwen2.5-1.5B Q8_0

Phi-4 14B Q8_0 (crash fix verification)

Qwen3.5-27B Q8_0

Qwen3.5-35B MoE Q8_0 (MUL_MAT_ID path)

M5 Max — TQ4_1S Weight Compression

Qwen2.5-1.5B Config I (1.28 GiB, 6.20 BPW)

Mac Mini M2 Pro — Qwen2.5-7B Q4_K_M

Summary

Uh oh!

signalnine commented Apr 2, 2026

Uh oh!

TheTom commented Apr 2, 2026

Bug: Gemma 4 (head_dim=256) produces garbage on current TOT

Uh oh!

joemc1470 commented Apr 2, 2026

HIP/ROCm Test Results — AMD RX 6600 (gfx1032), ROCm 6.4

Build

Quantization

Inference — GPU (HIP)

Inference — CPU (TQ4_1S)

Summary

Uh oh!

sjoerdmaessen commented Apr 2, 2026

Uh oh!

TheTom commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TheTom commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Upstream conflict: activation rotation (commit `744c0c7`)

Regression test (M5 Max, rebased branch `cb8bddc`)

Full Regression Test — PR #45 (`cb8bddc`)

TheTom commented Apr 2, 2026 •

edited

Loading

TheTom commented Apr 2, 2026 •

edited

Loading