feat: TQ4_1S weight compression (Metal only, needs CUDA port)#45
feat: TQ4_1S weight compression (Metal only, needs CUDA port)#45TheTom wants to merge 141 commits intofeature/turboquant-kv-cachefrom
Conversation
Adds CUDA dequantization for TQ4_1S (5.0 bpv) and TQ3_1S (4.0 bpv) WHT-rotated weight compression types. These achieve 27-37% model size reduction at +1.0-1.9% PPL on Qwen/Phi families. Base types + Metal + CPU quantize/dequant from TheTom's PR TheTom#45. CUDA additions: - turbo-quant.cuh: weight centroids (N(0,1) Lloyd-Max, 16/8 levels), sign array for 32-element inverse WHT - dequantize.cuh: dequantize_tq4_1s/tq3_1s — full 32-element block inverse RHT (5 butterfly stages + normalize + unsign) - convert.cu: TQ4_1S/TQ3_1S in all 4 dequant dispatchers - ggml-cuda.cu: supports_op for MUL_MAT and GET_ROWS, excluded from mmvq/mmq (uses cuBLAS dequant-to-f16 path) The cuBLAS path is correct for initial support. Future optimization: pre-rotate activations via warp shuffle WHT (same pattern as KV cache Q rotation) to eliminate per-block inverse WHT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Regression Test Results — PR #45Verified that the TQ4_1S weight compression PR does NOT break existing TurboQuant KV cache functionality or standard inference on non-compressed models. Hardware: M5 Max (128GB) + Mac Mini M2 Pro (32GB) Speed — No Regressions
All speeds normal or improved. No regressions. PPL — No Regressions (full wikitext-2 runs)
All PPL values match known-good. MUL_MAT_ID (MoE path) verified working. VerdictALL TESTS PASS. 5 models, 2 hardware platforms, 4 KV configs. The |
6c3e503 to
cb8bddc
Compare
turbo4 SET_ROWS was using turbo3's shared template with wrong 2+1 bit packing. New dedicated kernel_set_rows_turbo4 with correct 3-bit packed indices + QJL signs. PPL: 679 → 6.19. Also added turbo4 prefill FA kernel instantiations (non-vec path). QJL ablation finding: disabling QJL improves PPL from 6.1894 to 6.1756 (identical to turbo3). QJL correction hurts quality in attention context. Consistent with scos-lab issue #45. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update: Rebased on upstream master + regression testBranch force-pushed. Now rebased on latest
Upstream conflict: activation rotation (commit 744c0c7)Upstream added graph-level Hadamard rotation for KV cache quantization (
Fix: disabled upstream rotation by default in our fork. Users can re-enable with Regression test (M5 Max, rebased branch cb8bddc)
All tests pass. No regressions. Phi-4 crash resolved. |
|
CUDA port available on our branch: signalnine/llama-cpp-turboquant What's implemented:
Results (Qwen2.5-7B TQ4_1S, RTX 5090):
The fused kernel pre-rotates the activation vector once per mul_mat via Happy to iterate on this if you have ideas for closing the CUDA gap further. |
|
This is great work, thank you for turning this around so fast. PPL matching between cuBLAS and fused confirms correctness. One question before we merge: can you confirm that uncompressed models (q8_0, q4_0, etc.) show no decode regression on this branch? i.e. the new code paths only activate for TQ4_1S/TQ3_1S and existing quant types run at the same speed as before the PR. |
CUDA kernel review — performance improvement opportunitiesNice work on the V8 pre-rotation approach. PPL matching confirms correctness. Here's what I see for closing the gap from 39% to 70-85% of q8_0: High priority (biggest decode wins)
Medium priority
Skip / low value
Realistic ceilingPer architecture with full tuning (NR0 + load dedup + vectorized + batch):
The 39% → 70-85% gap is primarily data reuse, not math precision. The pre-rotation design is correct — it just needs the activation tile shared across more rows per CTA. |
Full Regression Test — PR #45 (cb8bddc)Hardware
Quantize tool verification
M5 Max — Uncompressed weights + TurboQuant KVQwen2.5-1.5B Q8_0
Phi-4 14B Q8_0 (crash fix verification)
No crash. Upstream attn_rot disabled by default (commit cb8bddc). Qwen3.5-27B Q8_0
Qwen3.5-35B MoE Q8_0 (MUL_MAT_ID path)
M5 Max — TQ4_1S Weight CompressionQwen2.5-1.5B Config I (1.28 GiB, 6.20 BPW)
Mac Mini M2 Pro — Qwen2.5-7B Q4_K_M
Summary
All tests pass. PR is safe for review. |
…bug 1 Ports GGML_TYPE_TURBO4_0 to CUDA using the 4-bit PolarQuant format (16 centroids, nibble-packed, no QJL). Previously turbo4 crashed on CUDA with "cannot run the operation (SET_ROWS)". Changes TURBO4_USE_4BIT default from Metal-only to all backends. The 4-bit format (16 centroids) has better quality than the legacy 3-bit+QJL format and is simpler to implement (no residual projection). Full CUDA stack: - turbo-quant.cuh: 4-bit centroids, midpoints, nearest-centroid, dequant element, per-block quantize - set-rows.cu: k_set_rows_turbo4 kernel (128 threads, WHT rotation, 4-bit quantize, nibble pack via warp shuffle, corrected norm) - dequantize.cuh + convert.cu: turbo4 to f16/f32 - fattn-common.cuh: vec_dot_KQ_turbo4 + dequantize_V_turbo4 - fattn-vec.cuh + fattn.cu: VEC dispatch + all cross-type instances (turbo4×turbo4, turbo4×q8_0, turbo4×turbo3, turbo4×turbo2) - ggml-cpu.c: CPU FA vec_dot for turbo4 PPL (Qwen3.5, wikitext-2): 6.23 (+0.8% vs q8_0) at 3.8× compression Speed: 217 t/s decode (comparable to turbo3 222 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… (issue #28) The block-size divisibility check in llama-context.cpp rejected turbo4 on GLM-4.7 Flash (head_dim=576, QK_TURBO4=128, 576%128≠0) before the KV cache zero-padding code could run. Fix: for turbo types, compute the padded head_dim (ceil to 128) before the divisibility check, matching what llama-kv-cache.cpp actually does. Tested: GLM-4.7 Flash turbo4 loads and runs at 193 t/s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add V-only layer-adaptive modes to TURBO_LAYER_ADAPTIVE env var:
- Mode 5: first2+last2 V=turbo4, rest V=turbo2
- Mode 6: last8 V=turbo4, rest V=turbo2
- Mode 7 (recommended): first2+last2 V=q8_0, rest V=turbo2
Mode 7 ("Boundary V") protects quality-sensitive boundary layers with
q8_0-V while aggressively compressing middle layers with turbo2-V.
K cache is unchanged (stays at whatever -ctk specifies).
Validated on Metal (M5 Max) across 4 models, 2 context lengths:
- phi-4-Q8_0: 4.784 PPL (vs turbo2 4.835, turbo3 4.742)
- Qwen2.5-7B Q4_K_M: 6.835 (vs turbo2 6.911, turbo3 6.707)
- Qwen3.5-35B MoE: 5.148 (vs turbo2 5.257, turbo3 5.137)
- Qwen3.5-27B Dense: 6.423 (vs turbo2 6.534, turbo3 6.273)
- 8K context: stable, no collapse
- NIAH retrieval: pass
- Speed: no penalty
Effective V compression between turbo2 and turbo3, closer to turbo2
on deeper models. Quality consistently better than uniform turbo2-V.
Usage: TURBO_LAYER_ADAPTIVE=7 llama-server -m model.gguf -ctk q8_0 -ctv turbo2 -fa 1
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
…w-up) state_write_data and state_read_data used hparams.n_embd_k_gqa (576) for ggml_row_size, but turbo types zero-pad to 640. For turbo4 (QK=128), 576 % 128 != 0 → ggml_row_size assertion failure during prompt cache save on llama-server slot reuse. Fix: use k->ne[0] / v->ne[0] (actual padded tensor width) instead of hparams values in all four serialization paths (K write, K read, V write, V read). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rnels Port TheTom's warp-cooperative turbo3 SET_ROWS kernel and turbo2/turbo3 flash attention templates to HIP/ROCm (7900 XTX, gfx1100). HIP vendor header fixes: - Add cudaMemcpyToSymbol/FromSymbol -> hipMemcpyToSymbol/FromSymbol - Add cudaMemcpyHostToDevice/DeviceToHost mappings - Fix __shfl_sync, __shfl_xor_sync, __shfl_up_sync, __shfl_down_sync to support both 3-arg and 4-arg calls (CUDA allows defaulting width to warpSize, HIP macros required 4 args) - Add __ballot_sync -> __ballot with uint32_t cast (HIP returns 64-bit on wave64 platforms, turbo code expects 32-bit) HIP CMakeLists: - Add turbo3 and turbo2 flash attention template instances (same files as CUDA CMakeLists, were missing from HIP build) Tested: Mistral-Small-24B turbo3 PPL = 5.28 (+2.4% vs F16 baseline 5.16) Previously showed catastrophic PPL ~15000 due to CPU quantize stub bug (fixed by TheTom in 53f1298). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
One norm per rotation group instead of four identical copies.
Eliminates 6 bytes of redundant storage per 128-element group.
turbo3: 3.50 -> 3.125 bits/value, 4.57x -> 5.12x compression
turbo2: 2.50 -> 2.125 bits/value, 6.4x -> 7.53x compression
Zero PPL regression validated across:
- Asymmetric q8_0-K + turbo{2,3}-V
- Symmetric turbo3/turbo3
- Boundary V (LA-V7)
- 3 architectures (dense, Qwen, MoE)
- 3 context lengths (512, 8K, 32K)
- 2 Apple Silicon platforms (M5 Max, M2 Pro)
- NIAH 3/3 pass
+3-7% decode on tested M2 Pro setup. No regression on M5.
Also adds derived NL_TURBO3/NL_TURBO2 macros replacing ~250
hardcoded FA template nl values. Block size is now a one-line
edit in ggml-common.h.
Credit to @AmesianX whose block_size=256 CUDA implementation
prompted this investigation.
Co-Authored-By: tturney@psyguard.ai
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The block_size=128 change (adac2c6) broke CUDA quantization: with QK=128, blocks_per_group=1, but the warp-cooperative packing still used blk_base+warp_id, causing warps 1-3 to write OOB. Fix: compute elem_in_block = j % QK_TURBO_N and use it for block pointer (j / QK_TURBO_N) and byte offsets (elem_in_block / 4 for qs, elem_in_block / 8 for signs). Works for both QK=32 and QK=128. Validated on RTX 3090 (sm_86), llama3.1:8b Q4_K_M, q8_0/turbo3: PPL = 7.587 (matches QK=32 baseline exactly).
Sparse V: now enabled by default on all Metal (was M5+ only). Validated across 30+ testers with zero PPL impact. Opt-out: TURBO_SPARSE_V=0. Boundary V: auto-enabled (mode 7) when -ctv turbo2 is set. Protects first 2 + last 2 layers with q8_0-V, rest turbo2-V. 37-91% quality recovery across 4 tested models. Opt-out: TURBO_LAYER_ADAPTIVE=0. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The HIP build was missing 9 turbo cross-type flash attention vec instantiations (turbo4 combos, turbo3/turbo2 cross-types) that were present in the CUDA CMakeLists but not mirrored to the HIP CMakeLists. Also guard the D>=576 tile kernel dispatch with #ifndef GGML_USE_HIP since those instance files are already excluded from the HIP build (they exceed HIP's 65536-byte local memory limit). Tested on: ROCm 6.4.4, gfx1151 (AMD Ryzen AI Max+ 395 / Strix Halo)
Leftover from 1-bit VX experiment. Causes -Werror build failure in CI. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add WHT-rotated weight quantization types: - TQ3_1S (type 44): 3-bit, 8 Lloyd-Max centroids, 4.0 BPW - TQ4_1S (type 45): 4-bit, 16 Lloyd-Max centroids, 5.0 BPW Both use 32-element Randomized Hadamard Transform with dual half-block scales (d0/d1). Quantization: forward RHT → scale search → iterative refinement (6 iter) → pack indices. Metal optimization (V2.1 fused kernel): - Zero threadgroup memory for rotation (was 20KB+ on large models) - Cooperative SIMD rotation via simd_shuffle_xor (registers only) - Single simd_sum at end (not per-block) - NR0=8 rows per threadgroup (amortizes rotation cost) - Memory barriers between rotate/matmul/unrotate dispatches - MoE MUL_MAT_ID support with rotated expert dispatch Config I (recommended): attn+ffn_gate/up=TQ4_1S, ffn_down=Q4_K, boundary 2+2 Validated on Qwen2.5-1.5B, Qwen3.5-27B, Qwen3.5-35B-A3B MoE: - 27-41% model size reduction - +1.3-1.9% PPL (Qwen), 94-102% decode speed - NIAH pass, KLD comparable to turbo3 KV - Llama 3.1 70B shows +25% PPL — needs investigation Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Without this barrier, the GPU may start executing the next node's matmul while the unrotate kernel is still modifying src1. This causes data corruption when TQ and non-TQ tensors are mixed within the same layer's attention block. The barrier ensures the unrotate completes before any subsequent operation reads from the same activation buffer. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uant) Upstream commit 744c0c7 added graph-level Hadamard rotation for KV cache quantization. This conflicts with our kernel-level WHT rotation and causes graph hash table overflow on Phi-4 and potentially other models. Disable by default since TurboQuant already handles rotation at the kernel level (more efficient, no extra graph nodes). Users can re-enable with LLAMA_ATTN_ROT_DISABLE=0 if needed. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…stration - ggml-cuda.cu: add TQ4_1S/TQ3_1S exclusion in ggml_cuda_should_fuse_ mul_mat_vec_q (was missing, causing ABORT in mmvq.cu) - tools/quantize/quantize.cpp: register TQ3_1S/TQ4_1S in allowed types Tested: Qwen2.5-7B TQ4_1S — correct output, PPL 8.82 (+1.1% vs Q2_K), 6510 t/s prefill, 20 t/s decode (cuBLAS dequant-to-f16 path). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two-phase approach: pre-rotate activation once via warp shuffle WHT, then simple mmvq kernel reads pre-rotated values (centroid × scale only, zero WHT per block). Results (Qwen2.5-7B TQ4_1S, RTX 5090): Decode: 20.3 → 69 t/s (3.4x speedup) vs q8_0 (177 t/s): 39% PPL: 8.82 (identical) Comprehensive optimization log (13 versions tested): cuBLAS baseline: 20 t/s V1 per-warp WHT, 4 warps: 60 t/s (3.0x) V3 shmem activation cache: 33 t/s (syncthreads kills it) V5 multi-warp per row: 62 t/s V6 LUT (shmem): 37 t/s (sync overhead) V7 8 warps clean: 62 t/s V8 pre-rotation (2-phase): 69 t/s ← BEST (3.4x) V9 pre-rot + q8_1: 70 t/s (marginal) V10 4-elem/thread: 57 t/s V13 8-elem, 4-thread dot: 45 t/s NR0=2/4/8: all regressed (register spill / cache thrash) The gap to q4_0 (275 t/s) is from dp4a integer intrinsics and packed int32 processing — TQ4_1S requires float centroid lookup which can't use dp4a. The gap to TheTom's Metal (85-99% of q8_0) is from Apple Silicon's cooperative SIMD efficiency. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
MoE models use MUL_MAT_ID which calls cudaStreamSynchronize in the expert routing dispatch. This is incompatible with CUDA graph capture. TQ4_1S types now disable CUDA graphs for MUL_MAT_ID nodes, matching the existing behavior for non-quantized types. Also: use persistent buffer for activation pre-rotation scratch, with graph-capture-safe check. Tested: Qwen3.5-35B TQ4_1S — 47 t/s decode, PPL 6.42. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ratch Replace two-phase approach (separate pre-rotation kernel + global memory round-trip) with single-phase kernel where all 8 warps cooperatively WHT-rotate activation into shared memory, then each warp processes one row reading from shmem (broadcast reads from L1). Key changes: - No global scratch buffer (eliminates CUDA graph incompatibility) - No separate kernel launch for pre-rotation - Activation stays in shmem (~14-32 KB) instead of global memory - Single __syncthreads between rotation and dot product (NOT in inner loop) - V8 two-phase fallback retained for ncols > 12288 (48 KB shmem limit) This avoids the NR0 regression that killed V3/V6/V11 — those had sync inside the dot product loop. V12's sync is between the two phases. Expected: 30-50% decode improvement on Ampere+ (shmem broadcast eliminates 2x activation bandwidth). Pascal improvement smaller (still bandwidth bound). NEEDS TESTING — apply and benchmark on Ampere/Ada/Blackwell. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add _USE_MATH_DEFINES + M_PI fallback for MSVC (doesn't define M_PI) - Add GGML_API to turbo3_cpu_wht_group_size for DLL export - Move extern declaration to file scope with extern "C" GGML_API linkage to fix C vs C++ name mangling across DLL boundary All changes are no-ops on Linux/Mac. Fixes MSVC build errors: C2065: 'M_PI': undeclared identifier LNK2001: unresolved external symbol turbo3_cpu_wht_group_size Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add missing CUDA->HIP stream capture API mappings (vendors/hip.h) - Add TURBO_IQ_API macro for cross-DLL symbol visibility (Windows + Linux) - Add fileno/isatty POSIX compat macros for clang on Windows No kernel changes needed. signalnine's fused mmvq-tq kernel uses __shfl_xor_sync which maps directly to HIP warp shuffle on RDNA 4. Tested: RX 9070 XT (gfx1201, RDNA 4), Qwen2.5-1.5B Config I. Result: 30% faster decode than Q8_0 (135 vs 104 t/s), +1.8% PPL. Metal regression: clean, no changes to non-Windows/non-HIP paths. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Tested V12 on RTX 5090 (Blackwell, sm_120). Build aa7c82a, Qwen2.5-7B-Instruct quantized from f16.
V12 is within noise of V8 on this card — 66.8 vs 69.2 t/s. The shmem broadcast doesn't help on Blackwell, probably because the 128 MB L2 on the 5090 was already caching the global scratch buffer effectively. The global memory round-trip that V12 eliminates wasn't a bottleneck here. Interesting that Mario's dual 4090 gets 70.6% of Q8_0 while the 5090 gets 38.5%. The 4090 has ~1 TB/s bandwidth vs the 5090's ~1.8 TB/s, so the 4090 is more compute-bound relative to bandwidth — the centroid lookup cost is a smaller fraction of total time. On the 5090 the kernel is solidly bandwidth-bound but the float activation + float FMA path can't saturate the available bandwidth the way dp4a can. I've been doing extensive kernel optimization on our branch — tested V8 through V19 plus a TQ4_0 prototype (WHT + uniform q4_0 format with native dp4a). Full optimization log here: https://github.com/signalnine/llama-cpp-turboquant/blob/feature/tq4-weight-cuda/docs/tq4-weight-cuda-optimization-log.md Key finding: the centroid lookup itself is NOT the bottleneck (confirmed via ablation — replacing |
1ca65b5 to
3ae5a42
Compare
Bug: Gemma 4 (head_dim=256) produces garbage on current TOTReproduced on Metal (M5 Max). Gemma 4 31B with asymmetric KV (q8_0/turbo) produces garbled multilingual garbage output. This matches the same class of corruption reported in #47 on CUDA. Not CUDA-specific. Not multi-GPU-specific. It's a head_dim=256 + asymmetric code path bug that affects both Metal and CUDA. Investigating and fixing now. Will be a separate commit on this PR. |
HIP/ROCm Test Results — AMD RX 6600 (gfx1032), ROCm 6.4Hardware: AMD RX 6600 (gfx1032, 8GB VRAM) Build✅ HIP build succeeds with one fix required:
Fix: remove trailing -# define GGML_API __attribute__ ((visibility ("default"))) extern
+# define GGML_API __attribute__ ((visibility ("default")))After this patch, Quantization✅ TQ4_1S compression works correctly:
Quantize time: 15 seconds on CPU. Inference — GPU (HIP)❌ Aborts in gfx1032 isn't in the rocBLAS TensileLibrary — only gfx1030 ships. Workaround Affects both Q8_0 and TQ4_1S (so not TQ4-specific — base HIP matmul is broken for this arch in the current build). Inference — CPU (TQ4_1S)❌ GGML_ASSERT failure — CPU dequant not implemented CPU inference with Q8_0 baseline works fine (266 t/s pp512, 22.5 t/s tg128), confirming the crash is TQ4_1S-specific — the CPU dequantize/matmul path for the new quant type appears unimplemented. Summary
Happy to test again once the CPU backend and gfx1032 matmul issues are addressed. |
…port) Gemma 4 uses global_head_dim=512 for full attention layers. The turbo FA kernels were only instantiated up to dk256 for symmetric and cross-turbo combos. Missing dk512_dv512 caused pipeline compilation failure on Gemma 4 (and any future model with head_dim=512 + turbo KV). Added 18 template instantiations (9 non-vec + 9 vec) for all turbo type combinations at dk512_dv512. Asymmetric q8_0/turbo combos already had dk512 and were not affected. Tested: Gemma 4 31B on M5 Max, symmetric turbo3/turbo3 and asymmetric q8_0/turbo4 both produce correct bench results at dk512. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Stack-allocated float tmp[4096] buffers in CPU vec_dot functions crashed on models with intermediate_size > 4096 (e.g. TinyLlama 5632, Qwen 27B 18944). Replaced with heap allocation. Affects CPU-only inference fallback path. GPU users unaffected. Reported by @oemc1470 on RX 6600 (gfx1032) where broken HIP forced CPU fallback. Tested: Qwen3.5-27B Config I, CPU-only (-ngl 0), intermediate_size=18944. No crash, no assert. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Excited to see TQ4_1S weight compression land on CUDA. Happy to benchmark on dual L40S once the CUDA kernel is ready for Ada testing. Interested in both the weight compression decode performance and the combined TQ4_1S weights + turbo KV cache stacking on this architecture. Just let me know. |
|
Thank you @joemc1470 for the thorough report. Great debugging. CPU assert (n <= 4096): Fixed in commit 21110eb. The CPU vec_dot functions had stack buffers sized for head_dim (128) but your model hits intermediate_size (5632) on the CPU fallback path. Replaced with heap allocation. Please pull TOT and retest. gfx1032 HIP matmul abort: This is a known upstream issue. The RX 6600 is not in the rocBLAS TensileLibrary (only gfx1030 ships). HSA_OVERRIDE_GFX_VERSION=10.3.0 is the documented workaround but it has known caveats with MMQ kernels. Multiple upstream issues filed (ggml-org#15244, ggml-org#15202, ggml-org#21106). Not something we can fix on our side. GGML_API extern on GCC 13.3: This is in upstream ggml.h, not our code. The trailing |
|
It is ready to test now. Pull TOT from the PR branch and follow the quickstart: For your 122B you will need a custom layer count (48 layers, but only 12 are attention). For a clean first test I would recommend Qwen3.5-27B Q8_0 (64 layers, fits comfortably on dual L40S). The quickstart has copy paste instructions. Build from Stacking with turbo KV works. Mario (on x) just confirmed on dual 4090: Config I + KV turbo4, no additional penalty from stacking. Your Ada L40S numbers would be very valuable. First datacenter GPU weight compression data. Also would be great to confirm no decode regression on uncompressed models (Q8_0 baseline with and without the PR branch). signalnine has found some issues in decode we need to look into. |
Summary
simd_shuffle_xor, NR0=8llama-quantize --allow-requantize --tensor-type-file config.txtTested Models
Llama Note
Llama-family models show 6-8x higher per-layer error amplification with WHT-rotated FFN tensors. Use Hybrid (TQ4 attn + Q4_K FFN) or Premium (TQ4 attn + Q5_K/Q6_K FFN) configs. Both beat Q4_K_M in quality and speed at similar size. Full investigation in the paper.
What's needed before merge
Metal only
The quantization step (
llama-quantize) works on any platform. The runtime dequant kernels are Metal-specific. Compressed GGUFs will not run correctly on CUDA/HIP until those backends are ported.Paper: https://github.com/TheTom/turboquant_plus/blob/main/docs/papers/weight-compression-tq4.md
Getting started: https://github.com/TheTom/turboquant_plus/blob/main/docs/getting-started.md
🤖 Generated with Claude Code