Skip to content

TQ3_0: norm correction + zero block handling + full Metal GPU support#1

Open
devYRPauli wants to merge 2 commits into
Aaryan-Kapoor:turboquant-tq3_0from
devYRPauli:tq3_0-norm-correction-and-metal
Open

TQ3_0: norm correction + zero block handling + full Metal GPU support#1
devYRPauli wants to merge 2 commits into
Aaryan-Kapoor:turboquant-tq3_0from
devYRPauli:tq3_0-norm-correction-and-metal

Conversation

@devYRPauli

Copy link
Copy Markdown

What

Two related improvements on top of 1fb1fb3a (Add TurboQuant TQ3_0 KV cache quantization):

  1. fix(tq3_0): norm correction + zero block handling (commit e23fd44)
  2. feat(metal): add TQ3_0 KV cache support on Apple GPU (commit 684517a)

Why

Norm correction (correctness)

quantize_row_tq3_0_ref previously stored the raw block RMS in block_tq3_0.d. After Lloyd-Max quantization and the inverse Walsh-Hadamard transform, the decoded block norm no longer matches the original block norm. This norm mismatch corrupts key vector magnitudes, damages query-key dot products, and causes degeneration on K-path runs (K=tq3_0, V=f16) while V-path runs (K=f16, V=tq3_0) remain coherent. The near-zero guard rms = 1.0f also makes empty blocks decode as structured nonzero noise.

The fix:

  • Detect near-zero blocks (sum_sq < 1e-20f) and emit a true zero block (d = 0, packed qs zeroed). On decode, scale=0 yields a zero output with no noise.
  • For non-zero blocks: quantize and run the inverse WHT to measure the reconstruction norm, then store orig_norm / recon_norm as the scale. On decode, multiplying centroids by this factor restores the block norm.

Metal GPU support (feature)

On Metal, ggml_metal_device_supports_op returns false for GGML_TYPE_TQ3_0 because the type isn't listed in the switches. KV cache silently falls back to CPU and any -ngl > 0 run errors out. This PR registers the type as supported and provides the Metal kernels:

  • ggml-metal-device.m: add GGML_TYPE_TQ3_0 to the GET_ROWS, MUL_MAT (q4_0/q5_0/q5_1/q8_0 family), and CPY destination switches.
  • ggml-metal.metal: add tq3_0 centroid/boundary/sign mask constants, tq3_0_rht_forward / tq3_0_rht_inverse Walsh-Hadamard helpers, a dequantize_tq3_0 LUT-based decoder (preserves the host-side orig_norm / recon_norm scale semantics), and wires kernel_flash_attn_ext templates for tq3_0 K and V tensors.

Validation (M1 Pro 16 GB, Qwen 4B)

  • CPU K=tq3_0, V=f16 path: word-loop degeneration → coherent generation after norm correction.
  • GPU K=tq3_0 / V=tq3_0 with -ngl > 0: runs end-to-end on Metal, no CPU fallback, no degeneration.
  • Combined with K5/V4 hybrid and the QJL fix landing in TheTom/turboquant_plus (PR Updating README after running 60B of llama.cpp ggml-org/llama.cpp#93), 16K-context needle-in-haystack retrieval goes from 0% (stock) to 100%.

Reproducer and full write-up

https://github.com/devYRPauli/turboquant-m1pro-evaluation

Patch files for the two commits are also archived under patches/ in that repo for reference.

The tq3_0 quantizer stored the raw RMS of the input block in
`block_tq3_0.d`. After Lloyd-Max quantization and inverse
Walsh-Hadamard reconstruction, the decoded block norm no longer
matched the original block norm. The norm mismatch corrupted key
vector magnitudes and damaged query-key dot products, causing
generation degeneration on K-path runs (K=tq3_0, V=f16) while
V-path runs (K=f16, V=tq3_0) remained coherent.

Additionally, near-zero blocks were guarded by forcing rms=1.0,
which made zero-energy blocks decode as structured nonzero values
and inject noise into attention every step a block should be zero.

Fix:
* Detect near-zero blocks (`sum_sq < 1e-20`) and emit a true zero
  block: `d = 0`, packed qs zeroed. On decode, scale=0 produces a
  zero output with no noise.
* For non-zero blocks: quantize and run the inverse WHT to measure
  the reconstruction norm, then store `orig_norm / recon_norm` as
  the scale factor in `d`. On decode, multiplying centroids by this
  factor restores the original block norm.

Empirical impact on M1 Pro 16 GB Qwen 4B:
* K=tq3_0, V=f16 path: word-loop degeneration → coherent generation.
* Combined with K5/V4 hybrid and the QJL fix, 16K-context needle
  retrieval: 0% → 100%.

Refs: https://github.com/devYRPauli/turboquant-m1pro-evaluation
The original TQ3_0 commit (1fb1fb3) only provides CPU paths. On
Metal, ggml_metal_device_supports_op returns false for TQ3_0, so
the KV cache silently falls back to CPU and any -ngl > 0 run errors
out. This adds the Metal kernel and registers TQ3_0 as supported.

Changes:
* ggml-metal-device.m: register GGML_TYPE_TQ3_0 in the supports_op
  switches for GET_ROWS, MUL_MAT (q4_0/q5_0/q5_1/q8_0 family), and
  CPY destinations.
* ggml-metal.metal: implement TQ3_0 GPU support
  * tq3_0 centroids/boundaries/sign mask as constant arrays
  * tq3_0_rht_forward / tq3_0_rht_inverse Walsh-Hadamard helpers
  * dequantize_tq3_0 LUT-based decode with the host-side scale
    semantics (orig_norm / recon_norm) preserved
  * Wire kernel_flash_attn_ext templates for tq3_0 K and V tensors

Verified on M1 Pro 16 GB Qwen 4B: K=tq3_0 / V=tq3_0 runs end-to-end
on the GPU; no fallback to CPU, no degeneration. Combined with the
norm-correction commit and the QJL fix upstream, 16K-context needle
retrieval reaches 100%.

Refs: https://github.com/devYRPauli/turboquant-m1pro-evaluation
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.
To continue using code reviews, you can upgrade your account or add credits to your account and enable them for code reviews in your settings.

devYRPauli added a commit to devYRPauli/turboquant-m1pro-evaluation that referenced this pull request May 28, 2026
Add an Upstream Contributions section to README.md pointing at:
* TheTom/turboquant_plus#93 (QJL orthogonal projection + sqrt(d) scale)
* Aaryan-Kapoor/llama.cpp#1 (tq3_0 norm correction + Metal kernels)
* wxtry's 70e45b7e which independently fixed GGML context sizing
  upstream in llama-cpp-turboquant on 2026-03-29

Add inline upstream-status notes to FINDINGS.md under each
corresponding finding.

Add CLAUDE.md and FINAL_AUDIT_PROMPT.md to .gitignore: both were
internal prompts used to assemble this repo and are not findings.
@devYRPauli

Copy link
Copy Markdown
Author

Quick summary of what this changes and how I tested it.

It fixes two TQ3_0 correctness issues (the dequant norm/scale factor was wrong, and all-zero blocks weren't being handled), and it adds Metal GPU support for the type.

Testing was on an Apple M1 Pro (16GB) with Qwen 4B:

  • K=tq3_0, V=f16 on CPU: generation used to fall into a word loop, and after the norm fix it generates coherently.
  • K=tq3_0 / V=tq3_0 with -ngl > 0: runs end to end on Metal now, with no CPU fallback and no degeneration.
  • Needle-in-haystack passes at both 2K and 4K. Combined with the K5/V4 hybrid and the QJL fix I landed in fix(qjl): use orthogonal projection and sqrt(d) scale factor TheTom/turboquant_plus#93, 16K-context retrieval goes from 0% on stock to 100%.

Requirements:

  • I've read the contributing guidelines.
  • AI usage disclosure: yes. I used the Claude CLI to help write and edit these changes, but I've reviewed and tested everything myself and I'm responsible for the submitted code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant