TQ3_0: norm correction + zero block handling + full Metal GPU support by devYRPauli · Pull Request #1 · Aaryan-Kapoor/llama.cpp

devYRPauli · 2026-05-28T14:59:22Z

What

Two related improvements on top of 1fb1fb3a (Add TurboQuant TQ3_0 KV cache quantization):

fix(tq3_0): norm correction + zero block handling (commit e23fd44)
feat(metal): add TQ3_0 KV cache support on Apple GPU (commit 684517a)

Why

Norm correction (correctness)

quantize_row_tq3_0_ref previously stored the raw block RMS in block_tq3_0.d. After Lloyd-Max quantization and the inverse Walsh-Hadamard transform, the decoded block norm no longer matches the original block norm. This norm mismatch corrupts key vector magnitudes, damages query-key dot products, and causes degeneration on K-path runs (K=tq3_0, V=f16) while V-path runs (K=f16, V=tq3_0) remain coherent. The near-zero guard rms = 1.0f also makes empty blocks decode as structured nonzero noise.

The fix:

Detect near-zero blocks (sum_sq < 1e-20f) and emit a true zero block (d = 0, packed qs zeroed). On decode, scale=0 yields a zero output with no noise.
For non-zero blocks: quantize and run the inverse WHT to measure the reconstruction norm, then store orig_norm / recon_norm as the scale. On decode, multiplying centroids by this factor restores the block norm.

Metal GPU support (feature)

On Metal, ggml_metal_device_supports_op returns false for GGML_TYPE_TQ3_0 because the type isn't listed in the switches. KV cache silently falls back to CPU and any -ngl > 0 run errors out. This PR registers the type as supported and provides the Metal kernels:

ggml-metal-device.m: add GGML_TYPE_TQ3_0 to the GET_ROWS, MUL_MAT (q4_0/q5_0/q5_1/q8_0 family), and CPY destination switches.
ggml-metal.metal: add tq3_0 centroid/boundary/sign mask constants, tq3_0_rht_forward / tq3_0_rht_inverse Walsh-Hadamard helpers, a dequantize_tq3_0 LUT-based decoder (preserves the host-side orig_norm / recon_norm scale semantics), and wires kernel_flash_attn_ext templates for tq3_0 K and V tensors.

Validation (M1 Pro 16 GB, Qwen 4B)

CPU K=tq3_0, V=f16 path: word-loop degeneration → coherent generation after norm correction.
GPU K=tq3_0 / V=tq3_0 with -ngl > 0: runs end-to-end on Metal, no CPU fallback, no degeneration.
Combined with K5/V4 hybrid and the QJL fix landing in TheTom/turboquant_plus (PR Updating README after running 60B of llama.cpp ggml-org/llama.cpp#93), 16K-context needle-in-haystack retrieval goes from 0% (stock) to 100%.

Reproducer and full write-up

https://github.com/devYRPauli/turboquant-m1pro-evaluation

Patch files for the two commits are also archived under patches/ in that repo for reference.

The tq3_0 quantizer stored the raw RMS of the input block in `block_tq3_0.d`. After Lloyd-Max quantization and inverse Walsh-Hadamard reconstruction, the decoded block norm no longer matched the original block norm. The norm mismatch corrupted key vector magnitudes and damaged query-key dot products, causing generation degeneration on K-path runs (K=tq3_0, V=f16) while V-path runs (K=f16, V=tq3_0) remained coherent. Additionally, near-zero blocks were guarded by forcing rms=1.0, which made zero-energy blocks decode as structured nonzero values and inject noise into attention every step a block should be zero. Fix: * Detect near-zero blocks (`sum_sq < 1e-20`) and emit a true zero block: `d = 0`, packed qs zeroed. On decode, scale=0 produces a zero output with no noise. * For non-zero blocks: quantize and run the inverse WHT to measure the reconstruction norm, then store `orig_norm / recon_norm` as the scale factor in `d`. On decode, multiplying centroids by this factor restores the original block norm. Empirical impact on M1 Pro 16 GB Qwen 4B: * K=tq3_0, V=f16 path: word-loop degeneration → coherent generation. * Combined with K5/V4 hybrid and the QJL fix, 16K-context needle retrieval: 0% → 100%. Refs: https://github.com/devYRPauli/turboquant-m1pro-evaluation

The original TQ3_0 commit (1fb1fb3) only provides CPU paths. On Metal, ggml_metal_device_supports_op returns false for TQ3_0, so the KV cache silently falls back to CPU and any -ngl > 0 run errors out. This adds the Metal kernel and registers TQ3_0 as supported. Changes: * ggml-metal-device.m: register GGML_TYPE_TQ3_0 in the supports_op switches for GET_ROWS, MUL_MAT (q4_0/q5_0/q5_1/q8_0 family), and CPY destinations. * ggml-metal.metal: implement TQ3_0 GPU support * tq3_0 centroids/boundaries/sign mask as constant arrays * tq3_0_rht_forward / tq3_0_rht_inverse Walsh-Hadamard helpers * dequantize_tq3_0 LUT-based decode with the host-side scale semantics (orig_norm / recon_norm) preserved * Wire kernel_flash_attn_ext templates for tq3_0 K and V tensors Verified on M1 Pro 16 GB Qwen 4B: K=tq3_0 / V=tq3_0 runs end-to-end on the GPU; no fallback to CPU, no degeneration. Combined with the norm-correction commit and the QJL fix upstream, 16K-context needle retrieval reaches 100%. Refs: https://github.com/devYRPauli/turboquant-m1pro-evaluation

chatgpt-codex-connector · 2026-05-28T14:59:28Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

Add an Upstream Contributions section to README.md pointing at: * TheTom/turboquant_plus#93 (QJL orthogonal projection + sqrt(d) scale) * Aaryan-Kapoor/llama.cpp#1 (tq3_0 norm correction + Metal kernels) * wxtry's 70e45b7e which independently fixed GGML context sizing upstream in llama-cpp-turboquant on 2026-03-29 Add inline upstream-status notes to FINDINGS.md under each corresponding finding. Add CLAUDE.md and FINAL_AUDIT_PROMPT.md to .gitignore: both were internal prompts used to assemble this repo and are not findings.

devYRPauli · 2026-05-29T18:33:04Z

Quick summary of what this changes and how I tested it.

It fixes two TQ3_0 correctness issues (the dequant norm/scale factor was wrong, and all-zero blocks weren't being handled), and it adds Metal GPU support for the type.

Testing was on an Apple M1 Pro (16GB) with Qwen 4B:

K=tq3_0, V=f16 on CPU: generation used to fall into a word loop, and after the norm fix it generates coherently.
K=tq3_0 / V=tq3_0 with -ngl > 0: runs end to end on Metal now, with no CPU fallback and no degeneration.
Needle-in-haystack passes at both 2K and 4K. Combined with the K5/V4 hybrid and the QJL fix I landed in fix(qjl): use orthogonal projection and sqrt(d) scale factor TheTom/turboquant_plus#93, 16K-context retrieval goes from 0% on stock to 100%.

Requirements:

I've read the contributing guidelines.
AI usage disclosure: yes. I used the Claude CLI to help write and edit these changes, but I've reviewed and tested everything myself and I'm responsible for the submitted code.

devYRPauli added 2 commits May 28, 2026 10:58

devYRPauli mentioned this pull request May 28, 2026

PolarQuant KV cache compression (TurboQuant, ICLR 2026) ml-explore/mlx-lm#1060

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TQ3_0: norm correction + zero block handling + full Metal GPU support#1

TQ3_0: norm correction + zero block handling + full Metal GPU support#1
devYRPauli wants to merge 2 commits into
Aaryan-Kapoor:turboquant-tq3_0from
devYRPauli:tq3_0-norm-correction-and-metal

devYRPauli commented May 28, 2026

Uh oh!

chatgpt-codex-connector Bot commented May 28, 2026

Uh oh!

devYRPauli commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devYRPauli commented May 28, 2026

What

Why

Norm correction (correctness)

Metal GPU support (feature)

Validation (M1 Pro 16 GB, Qwen 4B)

Reproducer and full write-up

Uh oh!

chatgpt-codex-connector Bot commented May 28, 2026

Uh oh!

devYRPauli commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant