TQ3_0: norm correction + zero block handling + full Metal GPU support#1
Open
devYRPauli wants to merge 2 commits into
Open
Conversation
The tq3_0 quantizer stored the raw RMS of the input block in `block_tq3_0.d`. After Lloyd-Max quantization and inverse Walsh-Hadamard reconstruction, the decoded block norm no longer matched the original block norm. The norm mismatch corrupted key vector magnitudes and damaged query-key dot products, causing generation degeneration on K-path runs (K=tq3_0, V=f16) while V-path runs (K=f16, V=tq3_0) remained coherent. Additionally, near-zero blocks were guarded by forcing rms=1.0, which made zero-energy blocks decode as structured nonzero values and inject noise into attention every step a block should be zero. Fix: * Detect near-zero blocks (`sum_sq < 1e-20`) and emit a true zero block: `d = 0`, packed qs zeroed. On decode, scale=0 produces a zero output with no noise. * For non-zero blocks: quantize and run the inverse WHT to measure the reconstruction norm, then store `orig_norm / recon_norm` as the scale factor in `d`. On decode, multiplying centroids by this factor restores the original block norm. Empirical impact on M1 Pro 16 GB Qwen 4B: * K=tq3_0, V=f16 path: word-loop degeneration → coherent generation. * Combined with K5/V4 hybrid and the QJL fix, 16K-context needle retrieval: 0% → 100%. Refs: https://github.com/devYRPauli/turboquant-m1pro-evaluation
The original TQ3_0 commit (1fb1fb3) only provides CPU paths. On Metal, ggml_metal_device_supports_op returns false for TQ3_0, so the KV cache silently falls back to CPU and any -ngl > 0 run errors out. This adds the Metal kernel and registers TQ3_0 as supported. Changes: * ggml-metal-device.m: register GGML_TYPE_TQ3_0 in the supports_op switches for GET_ROWS, MUL_MAT (q4_0/q5_0/q5_1/q8_0 family), and CPY destinations. * ggml-metal.metal: implement TQ3_0 GPU support * tq3_0 centroids/boundaries/sign mask as constant arrays * tq3_0_rht_forward / tq3_0_rht_inverse Walsh-Hadamard helpers * dequantize_tq3_0 LUT-based decode with the host-side scale semantics (orig_norm / recon_norm) preserved * Wire kernel_flash_attn_ext templates for tq3_0 K and V tensors Verified on M1 Pro 16 GB Qwen 4B: K=tq3_0 / V=tq3_0 runs end-to-end on the GPU; no fallback to CPU, no degeneration. Combined with the norm-correction commit and the QJL fix upstream, 16K-context needle retrieval reaches 100%. Refs: https://github.com/devYRPauli/turboquant-m1pro-evaluation
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
devYRPauli
added a commit
to devYRPauli/turboquant-m1pro-evaluation
that referenced
this pull request
May 28, 2026
Add an Upstream Contributions section to README.md pointing at: * TheTom/turboquant_plus#93 (QJL orthogonal projection + sqrt(d) scale) * Aaryan-Kapoor/llama.cpp#1 (tq3_0 norm correction + Metal kernels) * wxtry's 70e45b7e which independently fixed GGML context sizing upstream in llama-cpp-turboquant on 2026-03-29 Add inline upstream-status notes to FINDINGS.md under each corresponding finding. Add CLAUDE.md and FINAL_AUDIT_PROMPT.md to .gitignore: both were internal prompts used to assemble this repo and are not findings.
Author
|
Quick summary of what this changes and how I tested it. It fixes two TQ3_0 correctness issues (the dequant norm/scale factor was wrong, and all-zero blocks weren't being handled), and it adds Metal GPU support for the type. Testing was on an Apple M1 Pro (16GB) with Qwen 4B:
Requirements:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Two related improvements on top of
1fb1fb3a(Add TurboQuant TQ3_0 KV cache quantization):fix(tq3_0): norm correction + zero block handling(commite23fd44)feat(metal): add TQ3_0 KV cache support on Apple GPU(commit684517a)Why
Norm correction (correctness)
quantize_row_tq3_0_refpreviously stored the raw block RMS inblock_tq3_0.d. After Lloyd-Max quantization and the inverse Walsh-Hadamard transform, the decoded block norm no longer matches the original block norm. This norm mismatch corrupts key vector magnitudes, damages query-key dot products, and causes degeneration on K-path runs (K=tq3_0, V=f16) while V-path runs (K=f16, V=tq3_0) remain coherent. The near-zero guardrms = 1.0falso makes empty blocks decode as structured nonzero noise.The fix:
sum_sq < 1e-20f) and emit a true zero block (d = 0, packedqszeroed). On decode, scale=0 yields a zero output with no noise.orig_norm / recon_normas the scale. On decode, multiplying centroids by this factor restores the block norm.Metal GPU support (feature)
On Metal,
ggml_metal_device_supports_opreturns false forGGML_TYPE_TQ3_0because the type isn't listed in the switches. KV cache silently falls back to CPU and any-ngl > 0run errors out. This PR registers the type as supported and provides the Metal kernels:ggml-metal-device.m: addGGML_TYPE_TQ3_0to theGET_ROWS,MUL_MAT(q4_0/q5_0/q5_1/q8_0family), andCPYdestination switches.ggml-metal.metal: addtq3_0centroid/boundary/sign mask constants,tq3_0_rht_forward/tq3_0_rht_inverseWalsh-Hadamard helpers, adequantize_tq3_0LUT-based decoder (preserves the host-sideorig_norm / recon_normscale semantics), and wireskernel_flash_attn_exttemplates fortq3_0K and V tensors.Validation (M1 Pro 16 GB, Qwen 4B)
K=tq3_0, V=f16path: word-loop degeneration → coherent generation after norm correction.K=tq3_0 / V=tq3_0with-ngl > 0: runs end-to-end on Metal, no CPU fallback, no degeneration.TheTom/turboquant_plus(PR Updating README after running 60B of llama.cpp ggml-org/llama.cpp#93), 16K-context needle-in-haystack retrieval goes from 0% (stock) to 100%.Reproducer and full write-up
https://github.com/devYRPauli/turboquant-m1pro-evaluation
Patch files for the two commits are also archived under
patches/in that repo for reference.