Merge turboquant#23962
Conversation
…l-org#30 Perplexity benchmarking reveals catastrophic quality failure: - f16: 6.121, q8_0: 6.111, q4_0: 6.142 - turbo3: 165.6 (27× worse) Speed benchmarks were meaningless — fast garbage. Root cause investigation needed before any quality claims. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. V cache returns rotated-space values (cosine=0.02 vs correct 0.987) 2. dynamic_cast to llama_kv_cache_context fails for MoE models (uses llama_memory_hybrid_context, not kv_cache_context) → Q rotation and V inverse rotation NEVER executed Fix: store rotation tensors in llm_graph_context, not KV cache. Or access through hybrid memory interface. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…gml-org#31 Block 128: PPL=165.6 (same as block 32) Disabled Q rotation: PPL=165.6 (same) Root cause: dynamic_cast fails for MoE hybrid memory context. Q rotation and V inverse rotation never execute. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ml-org#31 ggml-org#30 ROOT CAUSE: pre-rotate-queries never executed because: 1. Q ne[0]=256 (GQA concatenated heads), rotation matrix ne[0]=128 2. mctx dynamic_cast failed for MoE hybrid memory FIX: put inverse WHT rotation back in dequantize_full_block. This is slower (10.7 tok/s vs 77.7) but produces CORRECT results. PERPLEXITY RESULTS: - f16: 6.121 - q8_0: 6.111 - q4_0: 6.142 - turbo3: 6.194 (+1.2% vs q8_0) ✅ The speed optimization (pre-rotate-queries) needs to be reimplemented to work with GQA head layout and hybrid memory types. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Quality confirmed: PPL 6.194 (+1.4% of q8_0) Speed: 10.7 tok/s (inverse rotation in dequant, no pre-rotate-queries) Previous speed claims (51-77 tok/s) were invalid — measured garbage output speed. Key lessons documented for future reference. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prefill speed: 739 → 1074 tok/s (0.40x q8_0, was 0.27x) Quality: PPL 6.195 (unchanged from 6.194 baseline, +1.4% of q8_0) Metal shader changes: - turbo3_dequantize_full_block: WHT butterfly now runs in fp16 (half) Centroids fit in fp16 (max |val| = 0.19), butterfly add/sub stays in range. 2x throughput on Apple Silicon Metal fp16 ALUs. - dequantize_turbo3_0_t4: cooperative SIMD dequant for flash_attn_ext_vec All 32 SIMD lanes work on same block — each unpacks only its 4 elements, WHT butterfly runs across lanes via simd_shuffle. Eliminates 31/32 redundant full-block dequants. Graph changes: - Removed broken pre-rotate-queries code (WHT and RoPE don't commute — KV stores WHT(RoPE(K)) but graph rotation gave RoPE(WHT(Q))) - Added TODO comments documenting the root cause and fix path KV cache changes: - Fixed rotation matrix storage comments (R vs R^T after ggml layout analysis) - Fixed clear(true) zeroing rotation tensors without reinit (Codex catch) - Corrected ggml_backend_tensor_set to store R/R^T in correct orientation Docs: - quality-benchmarks.md: top-of-tree quality+speed table - turbo-speed-investigation.md: fp16 WHT results, RoPE/WHT commutativity - pre-rotate-queries-investigation.md: full debugging log (20+ builds) - turbo-quality-gate.sh: pre-push perplexity check script Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
These docs belong in our project, not in a fork of someone else's repo. Moved to https://github.com/TheTom/turboquant_plus/tree/main/docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Prefill: 1411 tok/s (0.52x q8_0, was 0.40x) PPL: 6.195 (unchanged, within 0.001 of baseline) Metal shader: turbo3_dequantize_full_block - WHT butterfly now uses 32 x half4 vectors instead of 128 x half scalars Stage h=1,2: intra-vector swizzle (half4 constructor reorder) Stage h=4..64: inter-vector butterfly with computed stride - Centroid lookup processes natural byte boundaries (4 elements per qs byte) - Sign application and norm scaling use vectorized half4/float4 Codex review: no correctness bugs. Butterfly pairing, centroid unpacking, and sign application all verified correct. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Pre-computed turbo_wht_signs1_h4[32] and turbo_wht_signs2_h4[32] as constant half4 arrays. Eliminates per-element float→half conversion and reduces constant memory reads from 4 per half4 to 1. Marginal improvement (~1%) — Metal compiler already optimized the constant reads. But cleaner code and consistent with the half4 WHT. PPL: 6.195 (unchanged) Codex: no issues (included in Exp1 review scope) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
THE BIG WIN: moved WHT rotation from per-block dequant to graph-level ggml_mul_mat ops. 47% speedup over previous best. Prefill: 2095 tok/s (0.78x q8_0, was 1424 = 0.53x) PPL: 6.201 (within 0.01 of 6.195 baseline) Compression: 4.9x (unchanged) Key insight: applying WHT in build_attn (after RoPE, before build_attn_mha) matches the K quantize pipeline exactly. K stores WHT(RoPE(K)) from SET_ROWS, Q becomes WHT(RoPE(Q)) from graph mul_mat. Dot products preserved. Changes: - llama-graph.cpp: Q forward rotation (R @ q) and V un-rotation (R^T @ cur) in the llm_graph_input_attn_kv build_attn overload - ggml-metal.metal: stripped WHT from turbo3_dequantize_full_block (returns centroid * norm in rotated space, graph handles un-rotation) Codex review: pipeline point correct, reshape dims correct, lifecycle OK. Noted: only covers one build_attn overload (sufficient for Qwen3MoE). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
THE BREAKTHROUGH: block-32 with graph-side WHT rotation reaches q8_0 parity. Prefill: 2747 tok/s (1.02x q8_0, was 0.78x with block-128) PPL: 5.460 (32-chunk) / 6.193 (8-chunk) — within noise of baseline Compression: 4.6x (slightly less than 4.9x due to per-block norm overhead) Changes: - QK_TURBO3: 128 → 32 (matches q4_0 block size for GPU parallelism) - dequantize_turbo3_0: simple centroid lookup + norm scale (no WHT, no full-block) - dequantize_turbo3_0_t4: same simple path (no SIMD shuffle needed) - Flash attention nl: 8→2 (non-vec), 32→8 (vec) matching new block size Why this works: with graph-side WHT rotation, dequant no longer needs the 128-element WHT butterfly. Each 32-element block can be decoded independently. Smaller blocks = more GPU parallelism = faster flash attention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Added TURBO_LAYER_ADAPTIVE env var for per-layer cache type selection: 0 = uniform (default) 1 = q8_0 for first+last 4 layers, turbo3 for middle 32 2 = q8_0 for last 8 layers, turbo3 for first 32 Results (Qwen3.5-35B-A3B, 8 chunks): uniform turbo3: PPL = 6.193 (+1.3% vs q8_0) mode 1: PPL = 6.185 (+1.2% vs q8_0) mode 2: PPL = 6.110 (+0.0% vs q8_0!!!) Mode 2 achieves q8_0 quality (PPL 6.110 vs 6.111) while compressing 32 of 40 layers at turbo3 (4.6x). Only the last 8 layers use q8_0. Effective compression: ~3.5x overall vs 2.0x uniform q8_0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
…ow guard 1. Thread-safe static init via C++ lambda (was data race on static int) 2. Guard n_layer >= 8 to prevent unsigned underflow on small models 3. Use const local for n_layer and is_turbo check PPL verified: mode 2 still gives 6.1095 (matching q8_0 baseline) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
…n data Part of ggml-org#32: turbo3 prefill degrades relative to q8_0 with context length. Changes so far: - Skip ggml_cont when tensors already contiguous (+1%, minimal) - Generated 32x32 rotation matrices (turbo-rotation-data-32.h) for reduced group size approach (16x less matmul compute) - Fixed V un-rotation to check v->type not k->type Next: update QK_TURBO3_GROUP, Metal WHT kernel, and KV cache for d=32. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Reducing WHT rotation group from 128 to 32 elements degrades quality. Python kurtosis test showed 3.06 (good) on random data, but real Qwen3.5 KV tensors need 128-element groups for proper Gaussianization. Group-32 also didn't help speed — actually slower at all context sizes. This approach is a dead end. Next: custom GGML_OP_TURBO_WHT for O(d log d) rotation without dense matmul. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Adds a new ggml operation for applying WHT rotation to 128-element groups. Replaces the previous dense ggml_mul_mat(128x128, ...) approach. Implementation: - ggml.h: new op enum + ggml_turbo_wht(tensor, direction) API - ggml.c: constructor with direction param in op_params - ggml-cpu/ops.cpp: CPU impl (fp32 butterfly, parallel over groups) - ggml-metal.metal: Metal kernel (fp16 half4 vectorized butterfly) - ggml-metal-device: pipeline getter, supports_op - ggml-metal-ops: dispatch with threadgroup-per-group layout - llama-graph.cpp: uses ggml_turbo_wht instead of mul_mat+reshape Results: - PPL: 6.211 (within tolerance of 6.19 baseline) - Context scaling: same as dense matmul (~8% gap at 4k vs q8_0) - The matmul was NOT the bottleneck — dequant per KV position is The custom op is still valuable: eliminates rotation tensor storage, cleaner graph (no reshape/cont), and correct O(d log d) complexity. The context scaling regression comes from flash attention dequant cost, not the graph rotation. Codex review: fixed missing OP_NAME table entry. Noted CPU fp32 vs Metal fp16 precision difference (acceptable, Metal is the target). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Unrolled dequant with batched byte reads. Each 4-element group reads qs and signs bytes ONCE instead of per-element. Codex-verified bit indexing. Context scaling results: ctx=1024: 0.981x q8_0 (was 0.976x) ctx=2048: 0.989x q8_0 (was 0.960x) ctx=4096: 0.981x q8_0 (was 0.921x) The ratio now stays FLAT at ~98% vs q8_0 across all context sizes. Previous 7.9% gap at 4k context reduced to 1.9%. PPL: 6.211 (within tolerance) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Checks both: 1. PPL within 5% of q8_0 baseline (8-chunk wikitext-2) 2. Context scaling ratio > 0.95 at 4K context Both must pass. Run: bash scripts/turbo-quality-gate.sh Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes ggml-org#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Half LUT for cache pressure + float4 * scalar norm (1 multiply vs 4). Verified on main: PPL 6.211, decode 78.4 short / 68.3 at 8K. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
llama-bench had a hardcoded ggml_type_from_name() that didn't include turbo types. Now turbo3 and turbo4 work with -ctk/-ctv flags. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Replace single 8-entry constant half LUT with two 4-entry LUTs (one for positive, one for negative centroids). Each lookup now has only 4 possible constant addresses instead of 8, reducing divergent constant cache access that causes 10x decode slowdown on M1 hardware. Codex review caught sign-mapping bug in initial magnitude+sign approach — the sorted centroid LUT has reversed magnitude order for negative values. Split LUT avoids this by keeping the original index mapping within each half. PPL: 6.2109 (identical to main) Decode M5: 74.0 tok/s (vs 77.4 main — 4.4% regression on M5) Target: significant improvement on M1 where constant cache is the bottleneck Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Signs can mix per element within a thread's 4-element dequant — each element independently selects from positive or negative LUT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
Port of @spiritbuun's norm correction from CUDA to Metal SET_ROWS. After quantizing all 128 elements in a group, compute the L2 norm of the centroid reconstruction vector and store: corrected_norm = original_norm / ||centroid_vector|| instead of raw original_norm. This corrects systematic norm shrinkage from codebook quantization. Zero decode cost — dequant code is unchanged, just reads a better stored norm value. Only adds 128 FMAs to the quantizer (not hot path). Results (Qwen3.5-35B-A3B, wikitext-2): Before: PPL 6.2109 (8-chunk), 5.4714 (32-chunk) — +1.6% vs q8_0 After: PPL 6.1756 (8-chunk), 5.4451 (32-chunk) — +1.1% vs q8_0 q8_0: PPL 6.1109 (8-chunk), 5.4145 (32-chunk) 0.5% quality improvement at literally zero speed cost. Original CUDA implementation: github.com/spiritbuun/llama-cpp-turboquant-cuda (commit 721880c) Co-Authored-By: spiritbuun <271142774+spiritbuun@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
…text overflow Two bugs that caused turbo3 to silently fail on pre-M5 Apple Silicon: 1. turbo3/turbo4 require flash attention for the dequant path, but llama-bench defaults to flash_attn=disabled. Auto-enable FA when turbo cache types are detected, with a warning log message. This fixes context creation failures on M2 Pro/Max and similar hardware. 2. KV cache ggml context was sized for exactly K/V tensors per layer, but turbo types add 2 rotation matrix tensors (turbo_rotation and turbo_rotation_inv) that weren't accounted for. Add +2 tensor overhead to prevent GGML_ASSERT(obj_new) failure. Tested on M5 Max (Apple9/has_tensor=true) and M2 Pro (Apple8/has_tensor=false). Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ported @spiritbuun's register centroid×norm LUT from CUDA to Metal. On CUDA: 96-97% of q8_0 decode (big win). On Metal: 75.2 tok/s vs 77.4 main (SLOWER — register spill). The cn[8] float array spills to device memory on Metal's smaller register file, making it slower than constant memory access. Reverted to proven constant half LUT + float norm broadcast. This is a fundamental Metal vs CUDA architecture difference: - CUDA: 255 registers per thread, cn[8] fits easily - Metal: smaller register file, 8 floats cause spill The split-LUT approach (2x4 half entries) was also tested earlier and showed similar regression (74.0 tok/s). Constant half[8] with float norm broadcast remains the fastest vec dequant on Apple Silicon. Co-Authored-By: spiritbuun <271142774+spiritbuun@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
…ling (Issue ggml-org#29) Three bugs from the block-size-32 refactor: 1. kernel_set_rows_turbo hardcoded turbo3 packing for turbo4 — split into separate kernel_set_rows_turbo3 and kernel_set_rows_turbo4 kernels. turbo4 now correctly does 3-bit PolarQuant + QJL residual correction. 2. Integer division in n_groups = nk0 / blocks_per_group silently dropped tail blocks for non-128-aligned head dims (e.g. dk=192). Added ceiling division with tail-group bounds checking in turbo3, and GGML_ASSERT in WHT dispatch to catch non-128-aligned tensors. 3. TURBO_D constant was semantically coupled to QK_TURBO4 — replaced with TURBO_ROT_DIM (= QK_TURBO3_GROUP) and added static_assert that QK_TURBO4 == QK_TURBO3_GROUP to guard against future drift. Closes ggml-org#29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…stack turbo_init_rotation() allocated a 128x128 float array (64KB) on the stack to generate the random Gaussian matrix, then memcpy'd it to the static turbo_rotation[]. llama.cpp worker threads have reduced stack sizes, causing segfault on first turbo4 quantize call. Fix: generate directly into the static turbo_rotation[] array, eliminating the intermediate stack allocation entirely. The Gram-Schmidt QR decomposition already runs in-place on turbo_rotation[]. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Hi @DFveloper, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
|
@DFveloper just FYI, PRs like this (massive AI-authored PR without respecting the contribution guidelines) will get you banned in the repo. |
Overview
Additional information
Requirements