feat: CUDA port of TurboQuant3 KV cache — 3.47x compression, 98.5% of F16 decode speed on RTX 5090#3
Conversation
Ports the Metal turbo3 implementation to NVIDIA CUDA. End-to-end working
on RTX 5090 with Qwen3.5 35B A3B Q4_K_M: 3.47x KV compression at 32K
context, ~4x max context extension (256K → 1M tokens on 32GB VRAM).
New files:
- ggml-cuda/turbo-quant.cuh — block_turbo3_0 layout, WHT sign arrays,
3-bit centroid LUT, dequant helpers,
quantize kernel (set_rows path)
- ggml-cuda/turbo-wht.cu/.cuh — GGML_OP_TURBO_WHT CUDA kernel; 128-thread
blocks, in-place butterfly in shared memory,
forward + inverse WHT via compile-time template
- ggml-cuda/template-instances/fattn-vec-instance-turbo3_0-turbo3_0.cu
— VEC flash-attention instance for D=64/128/256
Modified files:
- dequantize.cuh — dequantize_turbo3_0 (produces float2 pairs)
- convert.cu — all 5 to-fp16/fp32 dispatchers
- fattn-common.cuh — vec_dot_fattn_vec_KQ_turbo3_0, dequantize_V_turbo3_0,
dispatcher extensions
- fattn-vec.cuh — turbo3 treated as unquantized (f16-style nthreads_KQ)
- fattn.cu — route turbo3 exclusively to VEC kernel; add dispatch macro
- set-rows.cu — k_set_rows_turbo3 kernel: per-128-elem group quantization
with WHT rotation and norm correction
- ggml-cuda.cu — supports_op + compute dispatch for TURBO_WHT + SET_ROWS
- llama-kv-cache.cpp — +2 tensor overhead for rotation matrices
Benchmark (RTX 5090, Qwen3.5 35B A3B Q4_K_M, FA on):
KV memory @32k: 702 MiB (f16) → 202 MiB (turbo3) = 3.47x compression
Max context: ~256K (f16) → ~1M (turbo3) = ~4x extension
Decode @short: 233 t/s (q8_0) → 190 t/s (turbo3) = 0.82x
Prefill @32k: 6335 t/s (q8_0) → 1215 t/s (turbo3) = 0.19x
Note: prefill speed degrades significantly vs Metal (Metal: 0.99x q8_0 at all
contexts; CUDA: 0.19x at 32K). Root cause: turbo3 currently uses the VEC
flash-attention kernel; q8_0 uses the more efficient TILE/MMA kernels at long
context. TILE/MMA support for turbo3 is the next milestone.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove the 5-line early-return that forced turbo3 onto the VEC flash attention kernel. The VEC kernel is still used for decode (Q->ne[1]==1) via the existing dispatch logic, but prefill now goes through the Turing MMA kernel (RTX 5090 is SM 12.0 >> 7.5). launch_fattn already pre-dequantizes K/V to FP16 when need_f16_K/V are set (which TILE/MMA always pass as true). Our ggml_get_to_fp16_cuda and ggml_get_to_fp16_nc_cuda dispatchers for TURBO3_0 — added in the original CUDA port commit — provide that conversion automatically. Stride recalculation (nb11 = nb11*bs*sizeof(half)/ts) also works out correctly for turbo3 (bs=32, ts=14): nb11*32*2/14 = ne[0]*sizeof(half). ✓ Before (VEC only): After (MMA for prefill): 2K prefill: 5032 t/s (0.73× q8_0) 6734 t/s (0.98× q8_0) 8K prefill: 3110 t/s (0.46× q8_0) 6613 t/s (0.98× q8_0) 32K prefill: 1215 t/s (0.19× q8_0) 6168 t/s (0.97× q8_0) Matches Metal M5 Max result (0.99× q8_0 flat across all context sizes). Decode unchanged (VEC, ~0.64-0.82× q8_0 depending on context depth). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e (71→94 t/s, 98.5% of F16)
k_set_rows_turbo3 was the decode bottleneck: 1 thread/group serial kernel
gave 3.1% GPU utilisation (36.5 µs × 80 calls/token = ~21% of decode budget).
Replace with a fully parallel kernel — 1 block per 128-element group,
128 threads per block (one thread per element):
• Shared-memory WHT butterfly (7 stages, no atomics)
• Warp-reduce L2 norm + inter-warp accumulate via smem
• qs packed with __shfl_sync (4-wide gather), signs with __ballot_sync
• Reconstruction norm same pattern; one write per sub-block (warp lane 0)
Also tighten flash-attention dequant paths (fattn-common.cuh):
• vec_dot_fattn_vec_KQ_turbo3_0: elem0/elem1 always share the same
turbo3 block — load qs/signs once instead of twice per pair
• dequantize_V_turbo3_0: ne==4 fast path — load one qs byte and one
signs byte for all 4 elements; unrolled float2 / half2 output pairs
Benchmark (Qwen3.5 35B, RTX 5090, tg128):
Before: 71.86 t/s (0.75× q8_0)
After: 94.04 t/s (0.985× q8_0, within measurement noise of parity)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed TheTom#3 (TURBO_D). TheTom#1 and TheTom#2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete experiment log: #1 4-mag LUT: 15.1 at 8K (BEST, +38%) #2 Batched extract: 13.7 (+25%) #3 Inline FA block: 13.5 (I-cache pressure) #4 Deferred norm: 12.9 (loses ILP) #5 2-pair half2: 12.0 (ternary overhead) #6 Select chain: 11.9 (branches kill) #7 Bit-arithmetic: 11.6 (ALU too heavy) #8 FMA branchless: 11.4 (ALU still too heavy) #9 Named-reg ternary: 10.3 (branches worst) #10 Main (8-LUT): 10.95 (baseline) #11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
|
Looks like I've got some quality issues with larger context windows, converting to draft while I work those out. |
The outer loop in vec_dot_fattn_vec_KQ_turbo3_0 stepped k_KQ_0 by `nthreads` (8), but the Q register file is loaded in blocks of `nthreads*cpy_ne` (32) elements per thread — the same pattern used by the f16/bf16 VEC kernels. This caused thread t>0 to pair K element (16s + 2t) against Q element (64*(s/4) + 8t + 2*(s%4)), a complete index mismatch. Every generated token had garbage attention scores. Fix: match the f16 kernel pattern — step by nthreads*cpy_ne, add an inner k_KQ_1 loop over cpy_ne pairs, and index Q_v as Q_v[k_KQ_0/nthreads + k_KQ_1]. Also clean up stale "PPL 23.5 vs 6.19" TODO comments in llama-graph.cpp that documented the symptom of this bug. Tested on RTX 5090, Qwen3.5-35B-A3B-Q4_K_M: - PPL (wikitext-2): 6.2023 → 6.2996 (+1.57%, within 5% target) - NIAH: 11/11 at 4K–256K (matches q8_0; was 0/11 before fix) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bug fix: VEC flash-attention Q/K stride mismatch (commit 4c91451)Root cause
This caused thread This kernel is used for all generation steps (n_tokens ≤ 2), so every generated token had garbage attention scores. Prefill (MMA kernel path) was unaffected. Fix (3 lines changed)// Before — wrong stride, Q/K indices misaligned for thread t > 0:
for (int k_KQ_0 = 0; k_KQ_0 < D/2; k_KQ_0 += nthreads) {
const int k_KQ = k_KQ_0 + (threadIdx.x % nthreads);
...
const float2 qv = ((const float2 *) Q_v)[k_KQ_0/nthreads];
// After — matches f16 kernel pattern:
constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
constexpr int cpy_ne = cpy_nb / 4;
for (int k_KQ_0 = 0; k_KQ_0 < D/2; k_KQ_0 += nthreads*cpy_ne) {
for (int k_KQ_1 = 0; k_KQ_1 < cpy_ne; ++k_KQ_1) {
const int k_KQ = k_KQ_0 + (threadIdx.x % nthreads)*cpy_ne + k_KQ_1;
...
const float2 qv = ((const float2 *) Q_v)[k_KQ_0/nthreads + k_KQ_1];Test results (RTX 5090, Qwen3.5-35B-A3B-Q4_K_M)Perplexity (wikitext-2, 512 ctx):
NIAH single-needle (11 depth points, q8_0 vs turbo3):
Aggregate score is identical (turbo3 = q8_0). The few single-cell differences are in opposite directions and consistent with model-level retrieval variance, not KV compression degradation. Speed (llama-bench):
Prefill is at parity. Generation is ~5% slower due to the centroid lookup overhead — this model is compute-bound during decoding (Q4_K_M MoE weights dominate bandwidth), so the 3.47x smaller KV cache doesn't help here. On a weight-fp16 or large-batch workload the bandwidth savings would show. |
|
Thats interesting. I will give it a try. I gave the fork from Madreag/turbo3-cuda a run tonight and it had also issues at large context sizes. q8 Madreag/turbo3-cuda turboquant signalnine/llama-cpp-turboquant (this PR) TG Performance is better on large kv-caches, but still not on par with q8 quants |
Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed spiritbuun#3 (TURBO_D). spiritbuun#1 and spiritbuun#2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Optimizations from RotorQuant PR analysis: - TheTom#2: Flat no-WHT dequant kernel (256 threads, 4 elems/thread, no shmem, no syncthreads) — 32x fewer kernel launches than per-block version - TheTom#3: Shared memory centroid LUT in flat dequant kernel - Fused V*attn CUDA kernel (ready for future custom op integration) - Fixed ggml_turbo_wht assert: ne[0] % 32 (was 128) for WHT32 blocks Attempted direct MMVQ for V (no dequant) — incorrect because WHT rotation on attention weights corrupts the V dot product. WHT orthogonality only holds when both sides use matching block structure. Reverted to working WHT linearity approach (cast + post-matmul inv WHT). Final benchmarks on Qwen3.5-35B-A3B (RTX 5090): Prefill Decode vs f16 f16 baseline: 6860 187 100% turbo3 FA: 6832 188 100% ← parity! turbo3 MMVQ: ~60 139 74% Compression: 4.6x KV cache Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@signalnine , would you be so kind and take a look at #13 ? |
TheTom#13) Models with n_embd_head_k not divisible by QK_TURBO3_GROUP (128) — e.g. GLM-4.7 Flash / DeepSeek2 MLA with head_dim=576 — previously hard-crashed with GGML_ASSERT(ne00 % QK_TURBO3_GROUP == 0) in set-rows.cu. Fix: - supports_op for SET_ROWS: return false when turbo3 and ne00 % 128 != 0, so llama.cpp falls back to an unquantised KV path instead of asserting - get_best_fattn_kernel: return BEST_FATTN_KERNEL_NONE for turbo3 when K head dim is not 128-aligned (VEC kernel only instantiated for D∈{64,128,256}) Affected models can now load with -ctk turbo3 -ctv turbo3; the non-aligned heads silently use f16 KV while 128-aligned heads continue to use turbo3. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…heTom#13) Models with n_embd_head_k not divisible by 128 (e.g. GLM-4.7 Flash, DeepSeek2 MLA with head_dim=192/576) previously crashed with GGML_ASSERT in set-rows.cu or segfaulted in CPU flash attention. Changes: - llama-kv-cache.cpp: detect incompatible head_dim at KV cache init, log a warning, and auto-fall back to q8_0 instead of crashing - ggml-cuda.cu supports_op: return false for turbo3 SET_ROWS when ne00 % 128 != 0, and for TURBO_WHT when ne[0] % 128 != 0 - fattn.cu: return BEST_FATTN_KERNEL_NONE for turbo3 at non-128 D - ggml-cpu.c: add turbo3 entry to type_traits_cpu (vec_dot + from_float) so CPU flash attention can handle turbo3 K/V for models where CUDA FA returns NONE (e.g. D=192 which has no CUDA FA kernel for any type) - set-rows.cu: add tail kernel for non-128 remainder (future use) - turbo-wht.cu: head-dim-aware processing with tail pass-through - ops.cpp: CPU WHT head-dim-aware with tail identity copy Tested: DeepSeek-Coder-V2 (head_dim=192) now loads with -ctk turbo3 and auto-falls back to q8_0 with clear warning. Qwen3.5 (head_dim=128) continues to use turbo3 at full speed (223 t/s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Models with head_dim divisible by 64 but not 128 (e.g. DeepSeek2 MLA
head_dim=192, GLM-4.7 Flash head_dim=576) now use 64-element WHT
groups instead of crashing or falling back to q8_0.
Key changes:
**64-element WHT groups:**
- turbo-quant.cuh: 64-element sign arrays + FWHT-64 + rotate_64
- set-rows.cu: templated on GROUP_SIZE {128,64}, reads group_size
from SET_ROWS op_params (set by llama-kv-cache based on head_dim)
- turbo-wht.cu: templated on group_size, head-dim-aware dispatch
- ggml.h/ggml.c: ggml_turbo_wht() now takes explicit group_size
parameter (0=auto) to handle MLA where output dim differs from
K head dim
- ops.cpp: CPU WHT parameterized on group_size
**MLA fix — missing Q rotation in K-only build_attn:**
- build_attn(inp_attn_k, ...) had no turbo Q pre-rotation, while
SET_ROWS was applying WHT to K. Result: <unrotated_Q, rotated_K>
= garbage. Now all three build_attn variants apply Q rotation.
- Inverse WHT moved inside build_attn_mha (before v_mla projection)
and also added to the non-FA attention path.
**Guards + fallback:**
- supports_op, fattn.cu, llama-graph.cpp: relaxed from %128 to %64
- llama-kv-cache.cpp: falls back to q8_0 only when head_dim%64!=0
- CPU type_traits_cpu turbo3 entry for CPU FA vec_dot
Tested: GLM-4.7 Flash (head_dim=576) 208 t/s, DeepSeek-V2 (192) 143 t/s,
Qwen3.5 (128) 223 t/s — all producing correct output with turbo3.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e versa) Users can now mix turbo3 and q8_0 independently for K and V caches: -ctk turbo3 -ctv q8_0 # compress K more, keep V at higher quality -ctk q8_0 -ctv turbo3 # opposite tradeoff Changes: - New VEC FA template instances for turbo3/q8_0 cross-type pairs (fattn-vec-instance-turbo3_0-q8_0.cu, fattn-vec-instance-q8_0-turbo3_0.cu) - fattn-vec.cuh: extern declarations for mixed instances - fattn.cu: dispatch + allow turbo3/q8_0 mix through the K!=V type guard - llama-graph.cpp: inverse WHT guard now checks V type (not K type) — when K=turbo3 but V=q8_0, V values are NOT WHT-rotated so inverse WHT must not fire. For MLA, V is a view of K so v->type correctly reflects K's turbo type. Tested: - 16/16 coherence tests pass (4 models × 4 KV combos) - NIAH at 4K+32K: all combos within baseline variance (10-11/11) q8_0/q8_0: 10/11, q8_0/turbo3: 10/11, turbo3/turbo3: 10/11, turbo3/q8_0: 9/11 (1 extra miss, normal variance) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
quantize_row_turbo3_0_ref was a stub that only stored the L2 norm and zeroed out qs/signs — any CPU-only path (-ngl 0) silently produced garbage output. Now implements full PolarQuant: L2 norm → normalize → forward WHT rotation (signs1 → butterfly → signs2) → 3-bit centroid quantize → pack qs/signs → corrected norm (grp_norm / recon_norm). WHT group size (128 or 64) is communicated from the CPU SET_ROWS handler via a global variable, read from the op_params that llama-kv-cache.cpp sets based on head_dim. This handles models with different K and V head dims (e.g. DeepSeek2 K=192→64, V=128→128). Tested CPU-only (-ngl 0): - Mixtral 8x7B (head_dim=128): correct output at 4.5 t/s - DeepSeek-Coder-V2 (head_dim=192): correct output at 18.2 t/s - GPU paths unchanged (Qwen 203 t/s, DeepSeek 164 t/s, GLM 202 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds turbo2: 2-bit PolarQuant with WHT rotation, 2.5 bits/value, 6.4x compression vs f16. Same architecture as turbo3 but with a 4-centroid Lloyd-Max codebook instead of 8. Block format: block_turbo2_0 = 10 bytes per 32 values - norm (fp16, 2B): corrected L2 norm (grp_norm / recon_norm) - qs (8B): 2-bit centroid indices, 4 per byte Usage: -ctk turbo2 -ctv turbo2 Full stack: - ggml.h/ggml-common.h: GGML_TYPE_TURBO2_0 enum + block_turbo2_0 struct - ggml-turbo-quant.c: CPU quantize (with WHT) + dequantize - turbo-quant.cuh: CUDA centroids, midpoints, nearest-centroid, dequant - set-rows.cu: k_set_rows_turbo2 kernel (GROUP_SIZE-templated, parallel WHT) - dequantize.cuh + convert.cu: turbo2 to f16/f32 conversion - fattn-common.cuh: vec_dot_KQ_turbo2 + dequantize_V_turbo2 - fattn-vec.cuh + fattn.cu: VEC kernel dispatch + template instances - Mixed types: turbo2/q8_0 cross-type FA instances - common/arg.cpp: CLI --cache-type-k turbo2 - llama-graph.cpp + llama-kv-cache.cpp: graph + cache integration NIAH (Qwen3.5 35B, RTX 5090, 4K-512K): 4K-16K: 11/11 | 32K: 9/11 | 64K: 10/11 | 128K-512K: 11/11 Coherence: 31/31 across 4 models x all KV combos (GPU + CPU) Speed: 228 t/s decode on Qwen3.5 (comparable to turbo3 223 t/s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The mixed KV types commit changed the inverse WHT guard from checking k->type to v->type. For MLA, V is a view of K's first 512 elements (v->ne[0]=512, 512%128=0 → group=128). But K was quantized with 64-group WHT (k->ne[0]=576, 576%128≠0 → group=64). The mismatch produced garbage output on GLM-4.7 Flash. Fix: derive group_size from K when K is turbo (since K determines the rotation used during quantization), from V only when K is not turbo. Tested: 16/16 coherence (4 models × 4 KV combos), including GLM-4.7. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
InnerQ equalizes K channel variances before WHT rotation to reduce quantization error on models with anisotropic K distributions. Enabled via TURBO_INNERQ=N env var (N = calibration token count). Pipeline: calibrate per-channel K² stats → compute scales → apply scale to K before WHT (SET_ROWS) → apply 1/scale to Q and V output via ggml_turbo_wht op (scale passed as src[1]). Math: <Q/s, s*K> = <Q, K> preserves dot products. Key design: - d_innerq_active starts at 0 (CUDA zero-init) — kernel never multiplies by uninitialized scales - Auto-disables for 64-group models (group_size < 128) where channel mapping is not 1:1 across WHT groups - Auto-disables when max channel ratio < 1.2 (already balanced) - Cross-TU coordination via turbo-innerq.cu/cuh for scale_inv tensor updates between SET_ROWS and graph-side WHT - Scale_inv tensor stored in KV cache, initialized to identity Also: turbo2 (2-bit) now requires head_dim divisible by 128. At 64-group WHT, the 4-centroid codebook doesn't have enough resolution — DeepSeek (head_dim=192) produced garbage. turbo2 auto-falls back to q8_0 for non-128-aligned heads with a warning. Tested: 18/18 coherence (4 models × 4+ KV combos), InnerQ active on Qwen (222 t/s) and Mixtral (146 t/s), auto-disabled on GLM/DeepSeek. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In the VEC flash-attention kernel, skip V dequantization for KV positions where ALL attention weights are below 1e-6. At long context most positions contribute noise, not signal — skipping their V dequant saves compute with zero quality impact. Both half2 and f32 paths are covered. The threshold (1e-6) matches the existing SOFTMAX_FTZ_THRESHOLD behavior where exp() values below this are flushed to zero. Benefit scales with context length — at short context nearly every position contributes so no skip occurs. At 512K+ most positions are skipped during generation. Tested: 10/11 NIAH at 512K (matches pre-sparse-V), 3/3 coherence. Generation at 512K: 25 t/s, prefill: 2535 t/s. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PPL benchmarks revealed that 64-group WHT (for non-128-aligned heads) causes catastrophic quality loss on some models: - DeepSeek (head_dim=192): PPL 344,304 vs 9.9 baseline - GLM-4.7 (head_dim=576): PPL 22.79 vs 14.97 baseline (+52%) The 64-group WHT passed NIAH/coherence but the weaker decorrelation (6 stages, 3 groups per 192-dim head) doesn't preserve statistical quality. Straight PolarQuant (no WHT) is even worse. All turbo types now require head_dim divisible by 128. Models with non-128-aligned heads (DeepSeek2, GLM-4.7 Flash) auto-fall back to q8_0 with a warning. The 64-group WHT code remains for future investigation but is not used. PPL summary (Qwen3.5, head_dim=128, wikitext-2): f16: 6.20 | q8_0: 6.18 | turbo3: 6.31 (+2.2%) | turbo2: 6.69 (+8.3%) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Testing in progress |
Metal Regression Test ReportBranch: PPL — Zero RegressionAll 9 PPL values match pre-merge baselines to 4 decimal places.
NIAH (8K, 3 depth positions)
MoE 2/3 on q8_0 and turbo4 is a known baseline variance at this context length, not a regression. Decode Speed (tg128)
Codex Code Review (gpt-5.3-codex)Reviewed all shared-risk files (
Merge Resolution Details
VerdictNo Metal regressions. All PPL, NIAH, and decode results match or exceed pre-merge baselines across MoE, Dense, and ISWA architectures. Safe to merge. Tested via |
…back)
Instead of falling back to q8_0 for models with non-128-aligned heads,
pad each head to the next multiple of 128 before WHT rotation. The
padded elements are zero and don't affect dot products since WHT
preserves inner products: <WHT(Q_pad), WHT(K_pad)> = <Q, K>.
This keeps turbo compression active on DeepSeek2, GLM-4.7 Flash, and
other non-128 head_dim models. Padding overhead is wasted bits on the
zero-padded quantized elements:
head_dim=192 → 256 (33% overhead, still 3.5x compression)
head_dim=576 → 640 (11% overhead, still 4.1x compression)
Changes:
- llama-kv-cache.cpp: allocate K/V cache at padded dimension, pad
k_cur/v_cur per-head via ggml_pad before set_rows, return padded
views from get_k/get_v
- llama-graph.cpp: pad Q per-head before turbo_wht, extract original
V head_dim from attention output after inverse WHT. All three
build_attn variants (KV, K-only/MLA, ISWA) updated.
PPL (wikitext-2, 512 context, 8 chunks):
DeepSeek (192→256): 11.61 vs 9.90 baseline (+17%)
Previously: 344,304 with 64-group WHT (catastrophic)
GLM-4.7 (576→640): 9.11 vs 14.97 baseline
Qwen3.5 (128): 6.31 unchanged
Tested: 16/16 coherence, NIAH 9-10/11 at 32K, CPU-only works.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolved conflict in ggml-turbo-quant.c (kept both 4-bit centroids and CPU WHT). Updated ISWA build_attn to use new ggml_turbo_wht 5-arg signature. Removed redundant V inverse WHT from ISWA overload (now handled in build_attn_mha). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
|
Merged manually via command line — conflict in |
|
To clarify the merge process: all 14 of @signalnine's commits are in We merged locally because GitHub couldn't auto-merge due to conflicts in
The PR shows as "Closed" rather than "Merged" because we pushed the merge commit directly to the base branch. |
|
Thank you for the contributions! |
|
Awesome. Thanks for your hard work. |
Co-Authored-By: Will Hampson <Whamp@users.noreply.github.com>
…urbo4) Per TheTom's correction: - Norm correction: shared effort (spiritbuun=turbo4, TheTom=turbo3) - turbo4 resurrection + asymmetric K/V: TheTom's work - CUDA port: signalnine (PR TheTom#3), not spiritbuun - Block-128 ≠ AmesianX block-256 (storage vs rotation group)
Full attribution for all contributors: - Madreag: 10 CUDA kernel optimizations, 36 K×V combos, 15 LA modes, 3-GPU validation (1,351+ iterations), NIAH quality testing - TheTom: Metal impl, turbo4 resurrection, asymmetric K/V, turbo3 norm correction, block-128 research, sparse V concept - signalnine: original CUDA port (PR TheTom#3), InnerQ equalization - spiritbuun: turbo4 norm correction, inverse FWHT prefill - HyperionMS2040: block-128 SET_ROWS fix (7cb6edb)
Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed #3 (TURBO_D). #1 and #2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete experiment log: #1 4-mag LUT: 15.1 at 8K (BEST, +38%) #2 Batched extract: 13.7 (+25%) #3 Inline FA block: 13.5 (I-cache pressure) #4 Deferred norm: 12.9 (loses ILP) #5 2-pair half2: 12.0 (ternary overhead) #6 Select chain: 11.9 (branches kill) #7 Bit-arithmetic: 11.6 (ALU too heavy) #8 FMA branchless: 11.4 (ALU still too heavy) #9 Named-reg ternary: 10.3 (branches worst) #10 Main (8-LUT): 10.95 (baseline) #11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed #3 (TURBO_D). #1 and #2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Complete experiment log: #1 4-mag LUT: 15.1 at 8K (BEST, +38%) #2 Batched extract: 13.7 (+25%) #3 Inline FA block: 13.5 (I-cache pressure) #4 Deferred norm: 12.9 (loses ILP) #5 2-pair half2: 12.0 (ternary overhead) #6 Select chain: 11.9 (branches kill) #7 Bit-arithmetic: 11.6 (ALU too heavy) #8 FMA branchless: 11.4 (ALU still too heavy) #9 Named-reg ternary: 10.3 (branches worst) #10 Main (8-LUT): 10.95 (baseline) #11 Non-vec FA: 10.2 (wrong kernel) Ceiling: 24.5 (no dequant) Apple8 hardware truth: 1 divergent constant read < 7 ALU ops (even with fma) Branches cost MORE than divergent constant reads Array indexing ALWAYS spills on Metal 4 constant addresses is the sweet spot The 4-mag LUT is the dequant-level ceiling on Apple Silicon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Summary
This PR ports TurboQuant3 (turbo3) KV cache compression to CUDA, targeting SM 12.0 (RTX 5090 / Blackwell) with near-parity decode performance vs F16.
What's included
CUDA kernel port (
set-rows.cu,turbo-quant.cuh,turbo-wht.cu/cuh):k_set_rows_turbo3: quantises incoming F32 KV tokens into turbo3 blocks on GPU__shfl_sync(qs) and__ballot_sync(signs) — no atomicsFlash attention integration (
fattn-common.cuh,fattn-vec.cuh,fattn.cu):vec_dot_fattn_vec_KQ_turbo3_0: optimised KQ dot product — elem0/elem1 always share the same turbo3 block, so qs/signs loaded once per pair instead of twicedequantize_V_turbo3_0: ne==4 fast path — single qs byte + single signs byte covers all 4 elements; unrolled float2/half2 outputBug fix — VEC kernel Q/K stride mismatch:
vec_dot_fattn_vec_KQ_turbo3_0originally stepped the outer K loop bynthreads(=8) but Q registers are loaded in blocks ofnthreads * cpy_ne(=32). Threadtat stepsaccessed K element16s+2tbut Q element64*(s/4)+8t+2*(s%4)— matching only whent=0, s%4=0. Fixed by matching the f16 kernel: step bynthreads*cpy_ne, add innerk_KQ_1loop. The MMA kernel (prefill) was unaffected.Quality gate / auto-enable (
llama.cpp,ggml-cuda.cu):ggml_contextoverflow fix for large KV cache allocationsBenchmark results (Qwen3.5 35B A3B Q4_K_M, RTX 5090, tg128)
Memory: 3.47× compression vs F16 (3-bit vs 16-bit KV cache)
NIAH results (single-needle, Kamradt/RULER methodology)
Depth positions 0–100% at each context length. Score = correct needle retrievals / 11.
q8_0 baseline: 11/11 at 4K–64K, 10/11 at 128K, 10/11 at 256K. turbo3 matches or exceeds q8_0 at every length. The single miss at 128K (depth 20%) matches the q8_0 miss — not a turbo3 regression.
1M context tested with YaRN 4× rope scaling (
--rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 262144). q8_0 cannot fit at 1M alongside the model on a 32 GB card; turbo3 KV (~4.5 GB) fits with ~4 GB to spare.Key design choices
grp_norm / recon_norm(not rawgrp_norm) in the half-precision norm field so dequant is a single multiply__launch_bounds__(128)on the quantisation kernel prevents spilling with the large shared memory footprint🤖 Generated with Claude Code