Skip to content

feat: TQ4_1S weight compression (Metal only, needs CUDA port)#45

Draft
TheTom wants to merge 141 commits intofeature/turboquant-kv-cachefrom
pr/tq4-weight-compression
Draft

feat: TQ4_1S weight compression (Metal only, needs CUDA port)#45
TheTom wants to merge 141 commits intofeature/turboquant-kv-cachefrom
pr/tq4-weight-compression

Conversation

@TheTom
Copy link
Copy Markdown
Owner

@TheTom TheTom commented Apr 2, 2026

Summary

  • TQ3_1S (3-bit, 4.0 BPW) and TQ4_1S (4-bit, 5.0 BPW) weight quantization using WHT rotation + Lloyd-Max centroids
  • V2.1 fused Metal kernel: zero threadgroup memory, cooperative SIMD rotation via simd_shuffle_xor, NR0=8
  • Post-training quantization — no retraining, calibration data, or model modification required
  • Quantize via llama-quantize --allow-requantize --tensor-type-file config.txt

Tested Models

Model Config Size Reduction PPL Delta Decode NIAH
Qwen2.5-1.5B Config I -27% +1.9% 96% 6/6
Qwen3.5-27B Config I -28% +1.3% 99% 3/3
Qwen3.5-35B MoE Config I -37% +1.4% 102%
Qwen2.5-72B Config I -38% +3.9% 95% 3/3
Phi-4 14B Config I -36% +1.0% 254% 3/3
Llama 3.1 70B Premium -29% +5.8% fast 3/3
Llama 3.1 70B Hybrid -42% +16% 133% 3/3

Llama Note

Llama-family models show 6-8x higher per-layer error amplification with WHT-rotated FFN tensors. Use Hybrid (TQ4 attn + Q4_K FFN) or Premium (TQ4 attn + Q5_K/Q6_K FFN) configs. Both beat Q4_K_M in quality and speed at similar size. Full investigation in the paper.

What's needed before merge

  • CUDA port of V2.1 kernel (calling @signalnine 👀)
  • HIP/ROCm testing
  • Regression tests on existing TurboQuant KV functionality
  • Community validation on untested model families

Metal only

The quantization step (llama-quantize) works on any platform. The runtime dequant kernels are Metal-specific. Compressed GGUFs will not run correctly on CUDA/HIP until those backends are ported.

Paper: https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/weight-compression-tq4.md
Getting started: https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md

🤖 Generated with Claude Code

signalnine added a commit to signalnine/llama-cpp-turboquant that referenced this pull request Apr 2, 2026
Adds CUDA dequantization for TQ4_1S (5.0 bpv) and TQ3_1S (4.0 bpv)
WHT-rotated weight compression types. These achieve 27-37% model size
reduction at +1.0-1.9% PPL on Qwen/Phi families.

Base types + Metal + CPU quantize/dequant from TheTom's PR TheTom#45.
CUDA additions:

- turbo-quant.cuh: weight centroids (N(0,1) Lloyd-Max, 16/8 levels),
  sign array for 32-element inverse WHT
- dequantize.cuh: dequantize_tq4_1s/tq3_1s — full 32-element block
  inverse RHT (5 butterfly stages + normalize + unsign)
- convert.cu: TQ4_1S/TQ3_1S in all 4 dequant dispatchers
- ggml-cuda.cu: supports_op for MUL_MAT and GET_ROWS, excluded from
  mmvq/mmq (uses cuBLAS dequant-to-f16 path)

The cuBLAS path is correct for initial support. Future optimization:
pre-rotate activations via warp shuffle WHT (same pattern as KV cache
Q rotation) to eliminate per-block inverse WHT.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 2, 2026

Regression Test Results — PR #45

Verified that the TQ4_1S weight compression PR does NOT break existing TurboQuant KV cache functionality or standard inference on non-compressed models.

Hardware: M5 Max (128GB) + Mac Mini M2 Pro (32GB)
Branch: pr/tq4-weight-compression (commit 6c3e503)

Speed — No Regressions

Model Hardware Config pp512 tg128
Qwen2.5-1.5B Q8_0 M5 Max q8_0/q8_0 10,787 198
Qwen2.5-1.5B Q8_0 M5 Max q8_0/turbo4 10,460 141
Qwen2.5-1.5B Q8_0 M5 Max q8_0/turbo3 10,468 138
Phi-4 14B Q8_0 M5 Max q8_0/q8_0 1,052 33.7
Phi-4 14B Q8_0 M5 Max q8_0/turbo4 1,051 30.9
Qwen3.5-27B Q8_0 M5 Max q8_0/q8_0 408 17.6
Qwen3.5-27B Q8_0 M5 Max q8_0/turbo4 497 17.1
Qwen3.5-27B Q8_0 M5 Max q8_0/turbo3 487 17.0
Qwen3.5-35B MoE Q8_0 M5 Max q8_0/q8_0 2,920 76.6
Qwen3.5-35B MoE Q8_0 M5 Max q8_0/turbo4 2,878 69.6
Qwen2.5-7B Q4_K_M M2 Pro q8_0/q8_0 352 34.9
Qwen2.5-7B Q4_K_M M2 Pro q8_0/turbo4 351 30.8
Qwen2.5-7B Q4_K_M M2 Pro q8_0/turbo3 350 30.0
Qwen2.5-7B Q4_K_M M2 Pro turbo3/turbo3 346 26.4

All speeds normal or improved. No regressions.

PPL — No Regressions (full wikitext-2 runs)

Model q8_0/q8_0 q8_0/turbo4 q8_0/turbo3
Qwen2.5-1.5B 10.31 10.45 (+1.4%) 10.55 (+2.4%)
Phi-4 14B 6.54 6.55 (+0.2%)
Qwen3.5-27B 6.87 6.89 (+0.3%)
Qwen3.5-35B MoE 6.53 6.56 (+0.5%)

All PPL values match known-good. MUL_MAT_ID (MoE path) verified working.

Verdict

ALL TESTS PASS. 5 models, 2 hardware platforms, 4 KV configs. The ggml-metal-ops.cpp restructuring (MUL_MAT + MUL_MAT_ID dispatch) did not break any existing functionality. Safe for review.

@TheTom TheTom force-pushed the pr/tq4-weight-compression branch from 6c3e503 to cb8bddc Compare April 2, 2026 04:39
TheTom added a commit that referenced this pull request Apr 2, 2026
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit
packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed
indices + QJL signs. PPL: 679 → 6.19.

Also added turbo4 prefill FA kernel instantiations (non-vec path).

QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756
(identical to turbo3). QJL correction hurts quality in attention context.
Consistent with scos-lab issue #45.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 2, 2026

Update: Rebased on upstream master + regression test

Branch force-pushed. Now rebased on latest ggml-org/llama.cpp master (106 upstream commits). 3 commits on top:

  1. 187a2c1 feat: TQ3_1S + TQ4_1S weight quantization with V2.1 fused Metal kernels
  2. 9aaa054 fix: add post-unrotate memory barrier for in-layer mixing safety
  3. cb8bddc fix: disable upstream attn rotation by default (conflicts with TurboQuant)

Upstream conflict: activation rotation (commit 744c0c7)

Upstream added graph-level Hadamard rotation for KV cache quantization (llama : rotate activations for better quantization). This feature:

  • Crashes on Phi-4 (graph hash table overflow from extra rotation nodes)
  • Is redundant with our kernel-level WHT rotation (which is more efficient — no extra graph nodes)

Fix: disabled upstream rotation by default in our fork. Users can re-enable with LLAMA_ATTN_ROT_DISABLE=0. Our TurboQuant KV rotation is unaffected.

Regression test (M5 Max, rebased branch cb8bddc)

Test What Result Status
Config I quantize + speed Qwen 1.5B, 202 tg/s Matches expected
Config I PPL Qwen 1.5B, 10.77 (8ch) Within noise
Turbo4 KV speed Qwen 1.5B, 148 tg/s Matches expected
Turbo4 KV PPL Qwen 1.5B, 10.74 (8ch) Within noise
Phi-4 + turbo4 31.5 tg/s No crash ✅ Fixed
Large model turbo4 Qwen 27B, 17.4 tg/s Matches expected

All tests pass. No regressions. Phi-4 crash resolved.

@signalnine
Copy link
Copy Markdown

CUDA port available on our branch: signalnine/llama-cpp-turboquant feature/tq4-weight-cuda

What's implemented:

  • CUDA dequant for TQ4_1S/TQ3_1S (convert path + cuBLAS MUL_MAT)
  • Fused mul_mat_vec kernel with pre-rotated activations (warp shuffle WHT)
  • mmvq exclusion for fused dispatch path
  • llama-quantize registration for TQ4_1S/TQ3_1S types

Results (Qwen2.5-7B TQ4_1S, RTX 5090):

Metric cuBLAS path Fused kernel
Decode tg128 20 t/s 69 t/s
vs q8_0 (177 t/s) 11% 39%
PPL 8.82 8.82

The fused kernel pre-rotates the activation vector once per mul_mat via __shfl_xor_sync (5 butterfly stages, 32-element blocks), then the mmvq kernel just does centroid[idx] × scale with no per-block WHT. 13 kernel variants tested — the gap to q4_0 (275 t/s) is from dp4a integer intrinsics that TQ4_1S can't use (centroid lookup requires float). The gap to your Metal (85-99% of q8_0) is from Apple Silicon's cooperative SIMD efficiency.

Happy to iterate on this if you have ideas for closing the CUDA gap further.

@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 2, 2026

This is great work, thank you for turning this around so fast. PPL matching between cuBLAS and fused confirms correctness.

One question before we merge: can you confirm that uncompressed models (q8_0, q4_0, etc.) show no decode regression on this branch? i.e. the new code paths only activate for TQ4_1S/TQ3_1S and existing quant types run at the same speed as before the PR.

@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 2, 2026

CUDA kernel review — performance improvement opportunities

Nice work on the V8 pre-rotation approach. PPL matching confirms correctness. Here's what I see for closing the gap from 39% to 70-85% of q8_0:

High priority (biggest decode wins)

  1. NR0 multi-row CTA with shared activation reuse — Currently each warp handles one row independently, re-reading vy_rot from global memory. On Metal, NR0=8 (8 rows sharing one activation tile) was the single biggest optimization (decode went from 70% to 99% of baseline). On CUDA the benefit is shared reuse of vy_rot across rows per CTA. Realistic target: 69 → 110-140 t/s.

  2. Hot loop load dedupd0/d1 are loaded per-lane but only have 2 unique values per block. Broadcast via __shfl_sync once per half-warp. Similarly, qs[lane/2] is loaded by both even and odd lanes — load once, broadcast to partner.

  3. __restrict__ qualifiers + vectorized loads — The kernel pointers lack __restrict__ (compiler can't prove no aliasing). Adding it enables better instruction scheduling. Also consider float4 loads for the pre-rotated activation to hit 128-bit memory transactions.

Medium priority

  1. Small-batch kernels (ne[1]=2..8) — Real serving hits batch > 1 frequently. A tuned small-batch path before falling to cuBLAS would help.

  2. CTA shape sweep per architecture — NWARPS=8 was tested on 5090. Ampere/Ada may prefer different occupancy points. Worth parameterizing.

  3. Fused prefill path (MMQ-style) — Currently falls to dequant→cuBLAS for ne[1] > 1. A tiled matmul with shared-memory activation rotation would be faster for medium batch.

Skip / low value

  • dp4a / fp16 LUT — The pipeline is float activation × float centroid × float scale. Forcing int8 paths is awkward and unlikely to win vs a well-tuned float path.
  • Scratch buffer cleanup — Lifecycle issue (static cudaMalloc leak), not a throughput lever. Fix eventually but not urgent.

Realistic ceiling

Per architecture with full tuning (NR0 + load dedup + vectorized + batch):

  • Ampere: ~55-70% of q8_0
  • Ada: ~60-75%
  • Blackwell: ~70-85%

The 39% → 70-85% gap is primarily data reuse, not math precision. The pre-rotation design is correct — it just needs the activation tile shared across more rows per CTA.

@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 2, 2026

Full Regression Test — PR #45 (cb8bddc)

Hardware

  • M5 Max 128GB (Apple Silicon)
  • Mac Mini M2 Pro 32GB (Apple Silicon)

Quantize tool verification

Format Quantize Loads Generates Status
TQ4_1S Config I 1312 MiB (6.20 BPW) ✅ 198 tg
TQ3_1S attn-only 1730 MiB (8.17 BPW)
Q4_K_M (standard) 1060 MiB (5.00 BPW)

M5 Max — Uncompressed weights + TurboQuant KV

Qwen2.5-1.5B Q8_0

Config pp512 tg128 PPL (full)
q8_0/q8_0 10,742 209 10.31
q8_0/turbo4 10,604 148 10.45
q8_0/turbo3 10,537 141 10.55

Phi-4 14B Q8_0 (crash fix verification)

Config pp512 tg128 PPL (full)
q8_0/q8_0 1,083 34.0 6.54
q8_0/turbo4 1,088 31.1 6.55

No crash. Upstream attn_rot disabled by default (commit cb8bddc).

Qwen3.5-27B Q8_0

Config pp512 tg128 PPL (full)
q8_0/q8_0 554 17.5 6.87
q8_0/turbo4 498 16.9 6.89
q8_0/turbo3 501 16.0

Qwen3.5-35B MoE Q8_0 (MUL_MAT_ID path)

Config pp512 tg128 PPL (full)
q8_0/q8_0 2,837 77.0 6.53
q8_0/turbo4 2,826 77.8 6.56

M5 Max — TQ4_1S Weight Compression

Qwen2.5-1.5B Config I (1.28 GiB, 6.20 BPW)

Config pp512 tg128 PPL (full)
Config I 7,610 198 10.53
Config I + turbo4 KV 7,394 142

Mac Mini M2 Pro — Qwen2.5-7B Q4_K_M

Config pp512 tg128
q8_0/q8_0 316 ± 4.6 33.6 ± 0.2
q8_0/turbo4 348 ± 0.5 30.3 ± 0.3
q8_0/turbo3 346 ± 0.3 28.7 ± 0.4
turbo3/turbo3 338 ± 1.2 25.7 ± 0.3

Summary

  • 5 models tested across 2 hardware platforms
  • All PPL values match known-good (within measurement noise)
  • All speed values normal — no regressions
  • Phi-4 crash fixed (upstream attn_rot disabled)
  • MUL_MAT_ID (MoE) verified
  • Weight compression quantize + inference verified (TQ4_1S, TQ3_1S, Q4_K_M)
  • TurboQuant KV configs verified (q8_0/turbo4, q8_0/turbo3, turbo3/turbo3)

All tests pass. PR is safe for review.

signalnine and others added 20 commits April 2, 2026 13:07
…bug 1

Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format
(16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on
CUDA with "cannot run the operation (SET_ROWS)".

Changes TURBO4_USE_4BIT default from Metal-only to all backends.
The 4-bit format (16 centroids) has better quality than the legacy
3-bit+QJL format and is simpler to implement (no residual projection).

Full CUDA stack:
- turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid,
  dequant element, per-block quantize
- set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation,
  4-bit quantize, nibble pack via warp shuffle, corrected norm)
- dequantize.cuh + convert.cu: turbo4 to f16/f32
- fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4
- fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances
  (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2)
- ggml-cpu.c: CPU FA vec_dot for turbo4

PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression
Speed: 217 t/s decode (comparable to turbo3 222 t/s)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… (issue #28)

The block-size divisibility check in llama-context.cpp rejected turbo4
on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128≠0) before the
KV cache zero-padding code could run.

Fix: for turbo types, compute the padded head_dim (ceil to 128) before
the divisibility check, matching what llama-kv-cache.cpp actually does.

Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2

Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).

Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty

Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.

Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
…w-up)

state_write_data and state_read_data used hparams.n_embd_k_gqa (576)
for ggml_row_size, but turbo types zero-pad to 640. For turbo4
(QK=128), 576 % 128 != 0 → ggml_row_size assertion failure during
prompt cache save on llama-server slot reuse.

Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of
hparams values in all four serialization paths (K write, K read,
V write, V read).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rnels

Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3
flash attention templates to HIP/ROCm (7900 XTX, gfx1100).

HIP vendor header fixes:
- Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol
- Add cudaMemcpyHostToDevice/DeviceToHost mappings
- Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync
  to support both 3-arg and 4-arg calls (CUDA allows defaulting width
  to warpSize, HIP macros required 4 args)
- Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit
  on wave64 platforms, turbo code expects 32-bit)

HIP CMakeLists:
- Add turbo3 and turbo2 flash attention template instances (same files
  as CUDA CMakeLists, were missing from HIP build)

Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16)
Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug
(fixed by TheTom in 53f1298).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.

turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression

Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass

+3-7% decode on tested M2 Pro setup. No regression on M5.

Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.

Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The block_size=128 change (adac2c6) broke CUDA quantization:
with QK=128, blocks_per_group=1, but the warp-cooperative packing
still used blk_base+warp_id, causing warps 1-3 to write OOB.

Fix: compute elem_in_block = j % QK_TURBO_N and use it for block
pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs,
elem_in_block / 8 for signs). Works for both QK=32 and QK=128.

Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3:
PPL = 7.587 (matches QK=32 baseline exactly).
Sparse V: now enabled by default on all Metal (was M5+ only).
Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0.

Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set.
Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V.
37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
  The HIP build was missing 9 turbo cross-type flash attention vec
  instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were
  present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists.

  Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP
  since those instance files are already excluded from the HIP build
  (they exceed HIP's 65536-byte local memory limit).

  Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add WHT-rotated weight quantization types:
- TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW
- TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW

Both use 32-element Randomized Hadamard Transform with dual half-block
scales (d0/d1). Quantization: forward RHT → scale search → iterative
refinement (6 iter) → pack indices.

Metal optimization (V2.1 fused kernel):
- Zero threadgroup memory for rotation (was 20KB+ on large models)
- Cooperative SIMD rotation via simd_shuffle_xor (registers only)
- Single simd_sum at end (not per-block)
- NR0=8 rows per threadgroup (amortizes rotation cost)
- Memory barriers between rotate/matmul/unrotate dispatches
- MoE MUL_MAT_ID support with rotated expert dispatch

Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2

Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE:
- 27-41% model size reduction
- +1.3-1.9% PPL (Qwen), 94-102% decode speed
- NIAH pass, KLD comparable to turbo3 KV
- Llama 3.1 70B shows +25% PPL — needs investigation

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Without this barrier, the GPU may start executing the next node's
matmul while the unrotate kernel is still modifying src1. This
causes data corruption when TQ and non-TQ tensors are mixed within
the same layer's attention block.

The barrier ensures the unrotate completes before any subsequent
operation reads from the same activation buffer.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uant)

Upstream commit 744c0c7 added graph-level Hadamard rotation for KV
cache quantization. This conflicts with our kernel-level WHT rotation
and causes graph hash table overflow on Phi-4 and potentially other
models.

Disable by default since TurboQuant already handles rotation at the
kernel level (more efficient, no extra graph nodes). Users can
re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…stration

- ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_
  mul_mat_vec_q (was missing, causing ABORT in mmvq.cu)
- tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types

Tested: Qwen2.5-7B TQ4_1S — correct output, PPL 8.82 (+1.1% vs Q2_K),
6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two-phase approach: pre-rotate activation once via warp shuffle WHT,
then simple mmvq kernel reads pre-rotated values (centroid × scale
only, zero WHT per block).

Results (Qwen2.5-7B TQ4_1S, RTX 5090):
  Decode: 20.3 → 69 t/s (3.4x speedup)
  vs q8_0 (177 t/s): 39%
  PPL: 8.82 (identical)

Comprehensive optimization log (13 versions tested):
  cuBLAS baseline:              20 t/s
  V1  per-warp WHT, 4 warps:   60 t/s (3.0x)
  V3  shmem activation cache:  33 t/s (syncthreads kills it)
  V5  multi-warp per row:      62 t/s
  V6  LUT (shmem):             37 t/s (sync overhead)
  V7  8 warps clean:           62 t/s
  V8  pre-rotation (2-phase):  69 t/s ← BEST (3.4x)
  V9  pre-rot + q8_1:          70 t/s (marginal)
  V10 4-elem/thread:           57 t/s
  V13 8-elem, 4-thread dot:    45 t/s
  NR0=2/4/8: all regressed (register spill / cache thrash)

The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed
int32 processing — TQ4_1S requires float centroid lookup which can't
use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from
Apple Silicon's cooperative SIMD efficiency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the
expert routing dispatch. This is incompatible with CUDA graph capture.
TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching
the existing behavior for non-quantized types.

Also: use persistent buffer for activation pre-rotation scratch,
with graph-capture-safe check.

Tested: Qwen3.5-35B TQ4_1S — 47 t/s decode, PPL 6.42.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ratch

Replace two-phase approach (separate pre-rotation kernel + global memory
round-trip) with single-phase kernel where all 8 warps cooperatively
WHT-rotate activation into shared memory, then each warp processes one
row reading from shmem (broadcast reads from L1).

Key changes:
  - No global scratch buffer (eliminates CUDA graph incompatibility)
  - No separate kernel launch for pre-rotation
  - Activation stays in shmem (~14-32 KB) instead of global memory
  - Single __syncthreads between rotation and dot product (NOT in inner loop)
  - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit)

This avoids the NR0 regression that killed V3/V6/V11 — those had sync
inside the dot product loop. V12's sync is between the two phases.

Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates
2x activation bandwidth). Pascal improvement smaller (still bandwidth bound).

NEEDS TESTING — apply and benchmark on Ampere/Ada/Blackwell.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI)
- Add GGML_API to turbo3_cpu_wht_group_size for DLL export
- Move extern declaration to file scope with extern "C" GGML_API linkage
  to fix C vs C++ name mangling across DLL boundary

All changes are no-ops on Linux/Mac. Fixes MSVC build errors:
  C2065: 'M_PI': undeclared identifier
  LNK2001: unresolved external symbol turbo3_cpu_wht_group_size

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h)
- Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux)
- Add fileno/isatty POSIX compat macros for clang on Windows

No kernel changes needed. signalnine's fused mmvq-tq kernel uses
__shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4.

Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I.
Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL.
Metal regression: clean, no changes to non-Windows/non-HIP paths.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@signalnine
Copy link
Copy Markdown

Tested V12 on RTX 5090 (Blackwell, sm_120). Build aa7c82a, Qwen2.5-7B-Instruct quantized from f16.

Config PP512 (t/s) TG128 (t/s) vs Q8_0 TG
Q8_0 baseline 14,160 173.4
TQ4_1S V12 (shmem) 6,330 66.8 38.5%

V12 is within noise of V8 on this card — 66.8 vs 69.2 t/s. The shmem broadcast doesn't help on Blackwell, probably because the 128 MB L2 on the 5090 was already caching the global scratch buffer effectively. The global memory round-trip that V12 eliminates wasn't a bottleneck here.

Interesting that Mario's dual 4090 gets 70.6% of Q8_0 while the 5090 gets 38.5%. The 4090 has ~1 TB/s bandwidth vs the 5090's ~1.8 TB/s, so the 4090 is more compute-bound relative to bandwidth — the centroid lookup cost is a smaller fraction of total time. On the 5090 the kernel is solidly bandwidth-bound but the float activation + float FMA path can't saturate the available bandwidth the way dp4a can.

I've been doing extensive kernel optimization on our branch — tested V8 through V19 plus a TQ4_0 prototype (WHT + uniform q4_0 format with native dp4a). Full optimization log here: https://github.com/signalnine/llama-cpp-turboquant/blob/feature/tq4-weight-cuda/docs/tq4-weight-cuda-optimization-log.md

Key finding: the centroid lookup itself is NOT the bottleneck (confirmed via ablation — replacing TQ4_CENTROIDS_WEIGHT[idx] with (idx-8) gives identical 69 t/s). The real bottleneck is float32 activation bandwidth (4x vs q8_1) and float FMA vs dp4a arithmetic density (4x). TQ4_0 prototype hit 237 t/s by using dp4a, but the quality delta over q4_0 disappears without Lloyd-Max centroids.

@TheTom TheTom force-pushed the pr/tq4-weight-compression branch from 1ca65b5 to 3ae5a42 Compare April 2, 2026 18:19
@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 2, 2026

Bug: Gemma 4 (head_dim=256) produces garbage on current TOT

Reproduced on Metal (M5 Max). Gemma 4 31B with asymmetric KV (q8_0/turbo) produces garbled multilingual garbage output. This matches the same class of corruption reported in #47 on CUDA.

Not CUDA-specific. Not multi-GPU-specific. It's a head_dim=256 + asymmetric code path bug that affects both Metal and CUDA.

Investigating and fixing now. Will be a separate commit on this PR.

@joemc1470
Copy link
Copy Markdown

HIP/ROCm Test Results — AMD RX 6600 (gfx1032), ROCm 6.4

Hardware: AMD RX 6600 (gfx1032, 8GB VRAM)
ROCm version: 6.4.0
Build env: rocm/dev-ubuntu-24.04:6.4 Docker container on Unraid
Test model: TinyLlama 1.1B Chat Q8_0
Branch: PR #45 weight-compression


Build

HIP build succeeds with one fix required:

ggml/include/ggml.h:184 has a compiler error on GCC 13.3 (Ubuntu 24.04):

error: invalid use of 'extern' in linkage specification
  184 | #define GGML_API __attribute__ ((visibility ("default"))) extern

Fix: remove trailing extern from GGML_API macro (it's redundant inside extern "C" {} blocks):

-#        define GGML_API __attribute__ ((visibility ("default"))) extern
+#        define GGML_API __attribute__ ((visibility ("default")))

After this patch, libggml-hip.so builds cleanly.


Quantization

TQ4_1S compression works correctly:

Size BPW
Input (Q8_0) 1114.91 MiB 8.50
Output (TQ4_1S) 668.18 MiB 5.10
Reduction -40%

Quantize time: 15 seconds on CPU.


Inference — GPU (HIP)

Aborts in ggml_cuda_mul_mat_q

gfx1032 isn't in the rocBLAS TensileLibrary — only gfx1030 ships. Workaround HSA_OVERRIDE_GFX_VERSION=10.3.0 gets past library loading but the HIP matmul kernel still aborts at runtime:

/workspace/ggml/src/ggml-cuda/ggml-cuda.cu:100: ROCm error
_Z19ggml_cuda_mul_mat_qR25ggml_backend_cuda_contextPK11ggml_tensorS3_S3_PS1_

Affects both Q8_0 and TQ4_1S (so not TQ4-specific — base HIP matmul is broken for this arch in the current build).


Inference — CPU (TQ4_1S)

GGML_ASSERT failure — CPU dequant not implemented

/workspace/ggml/src/ggml-cpu/ggml-cpu.c:3464: GGML_ASSERT(n <= 4096) failed
  ggml_compute_forward_mul_mat

CPU inference with Q8_0 baseline works fine (266 t/s pp512, 22.5 t/s tg128), confirming the crash is TQ4_1S-specific — the CPU dequantize/matmul path for the new quant type appears unimplemented.


Summary

Test Result
HIP build (gfx1032) ✅ with extern fix in ggml.h
TQ4_1S quantization ✅ -40% size reduction
GPU inference (HIP) ❌ matmul kernel abort (gfx1032 + ROCm 6.4)
CPU inference (TQ4_1S) GGML_ASSERT(n <= 4096) in ggml-cpu.c:3464
CPU inference (Q8_0 baseline) ✅ 266 t/s pp, 22.5 t/s tg

Happy to test again once the CPU backend and gfx1032 matmul issues are addressed.

TheTom and others added 2 commits April 2, 2026 14:00
…port)

Gemma 4 uses global_head_dim=512 for full attention layers. The turbo
FA kernels were only instantiated up to dk256 for symmetric and
cross-turbo combos. Missing dk512_dv512 caused pipeline compilation
failure on Gemma 4 (and any future model with head_dim=512 + turbo KV).

Added 18 template instantiations (9 non-vec + 9 vec) for all turbo
type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already
had dk512 and were not affected.

Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric
q8_0/turbo4 both produce correct bench results at dk512.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions
crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632,
Qwen 27B 18944). Replaced with heap allocation.

Affects CPU-only inference fallback path. GPU users unaffected.

Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced
CPU fallback.

Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944.
No crash, no assert.

Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sjoerdmaessen
Copy link
Copy Markdown

Excited to see TQ4_1S weight compression land on CUDA. Happy to benchmark on dual L40S once the CUDA kernel is ready for Ada testing. Interested in both the weight compression decode performance and the combined TQ4_1S weights + turbo KV cache stacking on this architecture. Just let me know.

@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 2, 2026

Thank you @joemc1470 for the thorough report. Great debugging.

CPU assert (n <= 4096): Fixed in commit 21110eb. The CPU vec_dot functions had stack buffers sized for head_dim (128) but your model hits intermediate_size (5632) on the CPU fallback path. Replaced with heap allocation. Please pull TOT and retest.

gfx1032 HIP matmul abort: This is a known upstream issue. The RX 6600 is not in the rocBLAS TensileLibrary (only gfx1030 ships). HSA_OVERRIDE_GFX_VERSION=10.3.0 is the documented workaround but it has known caveats with MMQ kernels. Multiple upstream issues filed (ggml-org#15244, ggml-org#15202, ggml-org#21106). Not something we can fix on our side.

GGML_API extern on GCC 13.3: This is in upstream ggml.h, not our code. The trailing extern in the GGML_API macro is redundant inside extern "C" {} blocks and GCC 13.3 is stricter about it than clang. Your one-line fix is correct. We can carry it as a local patch but the proper fix should go upstream.

@TheTom
Copy link
Copy Markdown
Owner Author

TheTom commented Apr 2, 2026

It is ready to test now. Pull TOT from the PR branch and follow the quickstart:

https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md#weight-compression-tq4_1s--experimental

For your 122B you will need a custom layer count (48 layers, but only 12 are attention). For a clean first test I would recommend Qwen3.5-27B Q8_0 (64 layers, fits comfortably on dual L40S). The quickstart has copy paste instructions.

Build from pr/tq4-weight-compression branch with -DGGML_CUDA=ON. The fused CUDA kernel and all three bench commands are in the quickstart.

Stacking with turbo KV works. Mario (on x) just confirmed on dual 4090: Config I + KV turbo4, no additional penalty from stacking.

Your Ada L40S numbers would be very valuable. First datacenter GPU weight compression data.

Also would be great to confirm no decode regression on uncompressed models (Q8_0 baseline with and without the PR branch). signalnine has found some issues in decode we need to look into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.