fix(qjl): use orthogonal projection and sqrt(d) scale factor by devYRPauli · Pull Request #93 · TheTom/turboquant_plus

devYRPauli · 2026-05-28T14:55:49Z

What

The QJL stage in turboquant/qjl.py used a Gaussian random matrix S ~ N(0,1)^(d×d) for sign(S · x) with dequantization scale sqrt(π/2) / d. Both that pairing and the corrected orthogonal-S / sqrt(π/2) / sqrt(d) pairing are unbiased on the inner product ⟨x̂, y⟩ — the QJL stage isn't "off." The actual mechanism is variance, not magnitude. On real LLM KV vectors with head_dim=128:

The Gaussian projection introduces variance of order ||r||² / d per reconstructed dimension. Attention accumulation over thousands of tokens drives this into word-loop degeneration ("genetic genetic genetic...") the moment QJL is enabled.
Both estimators produce the same expected reconstruction norm ≈ √(π/2)·||x||. The sqrt(π/2) / d scale is the unbiased form for Gaussian S; sqrt(π/2) / sqrt(d) is the unbiased form for orthogonal S. The fix isn't a magnitude correction — it's swapping the projection family and matching the scale to it so variance collapses.

Fix

Generate S as a sign-corrected random orthogonal matrix via QR. Orthogonal projections preserve norms and inner products exactly; reconstruction variance collapses sharply.
Use the matching unbiased scale sqrt(π/2) / sqrt(d) so the orthogonal-S estimator stays unbiased on ⟨x̂, y⟩.
Add a shrinkage parameter (default 1.0 = classical paper-faithful unbiased QJL). MMSE-optimal value is 2/np.pi ≈ 0.6366 — derivation in the docstring (from E[||x̂||²] = (π/2)·||x||² and E[⟨x̂, x⟩] = ||x||²).
Enforce the orthogonality contract in __init__: ||S Sᵀ − I||_F < 1e-10. New TestQJLProjection.test_projection_matrix_is_orthogonal at d ∈ {64, 128, 256, 512} with the tighter < 1e-12 tolerance.

Empirical impact (M1 Pro 16 GB, Qwen 4B, K5/V4 hybrid)

Needle-in-haystack retrieval at 16K tokens: 0% → 100% with these QJL fixes combined with K5/V4 Hybrid KV cache.
Verified orthogonality of S to machine precision: ||S Sᵀ − I||_F ≈ 5e-15 to 4e-14 across d ∈ {64, 128, 256, 512}.
At default shrinkage=1.0, reconstruction norm ratio averages 1.253 = √(π/2) at every tested d, exactly matching the closed-form E[||x̂||²] = (π/2)·||x||².

Reproducer

Full benchmarks, logs and the post-mortem write-up: https://github.com/devYRPauli/turboquant-m1pro-evaluation

Note: the parallel norm_correction work landed in PolarQuant on 86bcbbe — this PR is independent of that and lands on the same main.

The QJL stage used a Gaussian random matrix S ~ N(0,1)^(d×d) for sign(S · x). Gaussian projections introduce variance of order ||r||^2/d per dimension; for head_dim=128, this is large relative to signal, and attention accumulation drives generation into word-loop degeneration once QJL is enabled on real LLM KV vectors. Fix: * Replace Gaussian S with a random orthogonal matrix via QR decomposition (sign-corrected to give a proper rotation). Orthogonal matrices preserve norms and inner products and reduce reconstruction variance sharply. * Correct the dequantization scale from sqrt(pi/2)/d to sqrt(pi/2)/sqrt(d). For d=128, the previous formula was ~11x too small, which effectively disabled the QJL correction term even when the stage was enabled. * Add a damping parameter (default 0.7) to dequantize(), validated as the stable operating point in M1 Pro 16K-context needle benchmarks on Qwen 4B with K5/V4 hybrid quantization. Empirical impact on M1 Pro (16 GB, Qwen 4B): * Needle retrieval at 16K tokens: 0% → 100% with these QJL fixes combined with K5/V4 Hybrid configuration. * Verified orthogonality: ||S S^T - I||_F < 1e-14. * Existing tests pass (e.g., test_dequantized_has_correct_scale reports avg norm ratio 0.877 at d=128, well within [0.5, 2.0]). Refs: turboquant-m1pro-evaluation reproducer and post-mortem at https://github.com/devYRPauli/turboquant-m1pro-evaluation

junleen · 2026-05-28T14:56:50Z

您好，这是来自QQ邮箱的自动回复邮件。我将尽快阅读您的邮件，谢谢！

Add an Upstream Contributions section to README.md pointing at: * TheTom/turboquant_plus#93 (QJL orthogonal projection + sqrt(d) scale) * Aaryan-Kapoor/llama.cpp#1 (tq3_0 norm correction + Metal kernels) * wxtry's 70e45b7e which independently fixed GGML context sizing upstream in llama-cpp-turboquant on 2026-03-29 Add inline upstream-status notes to FINDINGS.md under each corresponding finding. Add CLAUDE.md and FINAL_AUDIT_PROMPT.md to .gitignore: both were internal prompts used to assemble this repo and are not findings.

TheTom · 2026-05-28T15:34:44Z

Thanks. Math fix is right, and unbiased reconstruction is a real win over what was there. Couple of asks before merge.

1. Damping default.

Closed form: for unbiased orthogonal QJL with the √(d) scale, E[||x̂||²] = (π/2)·||x||² exactly, so MMSE-optimal shrinkage is α* = 2/π ≈ 0.6366. Empirical sweep at d ∈ {64,128,256,512} matches to grid resolution. damping=0.7 is ~2% off optimum.

Could you:

Default the new kwarg to 1.0 (unbiased, paper-faithful, backward-compatible for TurboQuant.dequantize).
Rename damping to shrinkage. Document 2/np.pi as the MMSE-optimal value in the docstring with the one-line derivation.

Callers who don't pass an explicit kwarg should keep getting classical unbiased QJL.

2. Framing nit on the PR body.

Original code wasn't ~11× too small / disabled. It was ~1.3× too large in norm and high-variance. Both estimators are unbiased on ⟨x̂, y⟩; the gap is variance, not magnitude. Worth a one-line correction in the description. The new framing is the actual mechanism, and it matches the "QJL eliminates bias but explodes variance" finding from turbo4-resurrection.md, now quantified.

3. Tighten tests in this PR.

Existing test_qjl.py passed the destructive code. Please add:

assert np.linalg.norm(qjl.S @ qjl.S.T - np.eye(d), 'fro') < 1e-12 in the constructor or as a test (orthogonality contract).
Tighten test_dequantized_has_correct_scale from [0.5, 2.0] to [0.95, 1.05] for the unbiased classical recovery.

4. Heads-up, not a blocker.

I'll update the README §QJL to flag this stage as reference-only. Production TheTom/llama-cpp-turboquant drops QJL on both K and V (recommended config --cache-type-k q8_0 --cache-type-v turbo3), per the 5-group consensus in turbo4-resurrection.md. Your K5/V4 result is the first interesting K-side QJL data I've seen. Would be great to see it pushed to 64K+ context where the variance mechanism historically manifests (buun's CUDA turbo4 degraded -0.28% at 2K to +3.69% at 64K with unit damping + broken Gaussian S; your math fix + shrinkage might mitigate). Separate issue, not a blocker for this fix.

devYRPauli · 2026-05-28T15:46:45Z

Thanks. Closed-form derivation is cleaner; using that. Working through:

1. Rename + default 1.0 — yes. damping → shrinkage, default 1.0, docstring documents 2/np.pi as MMSE-optimal with the one-line derivation. TurboQuant.dequantize callers without the kwarg get classical unbiased QJL.

2. Framing fix on PR body — correct, mine was wrong. Both estimators are unbiased on ⟨x̂, y⟩; the gap is variance, not magnitude. Rewriting the description: original was high-variance, new estimator collapses variance, which is what lets the two-stage design deliver the paper's ~64% MSE reduction.

3. Tests — adding the orthogonality contract (||S Sᵀ − I||_F < 1e-12) in __init__ and as a separate test.

Quick math check on test_dequantized_has_correct_scale: under the new shrinkage=1.0, E[||x̂||/||x||] ≈ √(π/2) ≈ 1.253, so [0.95, 1.05] would fail. Two options that match your intent:

Tighten to [1.20, 1.30] (match classical-QJL norm ratio), or
Repurpose to the unbiased-contract on the inner product itself: |E[⟨x̂, y⟩] − ⟨x, y⟩| < 3·SE over many trials. This is what the paper actually proves; test_inner_product_unbiased_single_side already covers d=256 — could extend to d ∈ {64, 128, 256, 512} and replace the norm test, or keep both.

Which do you prefer?

4. 64K+ follow-up — noted, separate issue. The variance-accumulation mechanism behind buun's −0.28% at 2K → +3.69% at 64K is exactly what the orthogonal projection should suppress; would be useful to confirm. Setting it up on remote hardware (M1 Pro 16 GB constrained me to 16K).

Pushing the rename + default 1.0 + docstring + orthogonality test now, and updating the PR description. Will tag once code is up so you can pick the test option.

@TheTom

Per @TheTom's review on PR TheTom#93: * Rename `damping` kwarg to `shrinkage` on QJL.dequantize and TurboQuant.dequantize. Default 1.0 — classical paper-faithful unbiased estimator. Existing callers without the kwarg get classical unbiased QJL (backward-compatible). * Docstrings document `2/np.pi ≈ 0.6366` as the MMSE-optimal shrinkage with the one-line derivation from E[||x̂||²] = (π/2)·||x||² and E[⟨x̂, x⟩] = ||x||². * Add the orthogonality contract: - `__init__` asserts `||S Sᵀ − I||_F < 1e-10` (cheap defensive check; required for the unbiased estimator). - New `TestQJLProjection.test_projection_matrix_is_orthogonal` at d ∈ {64, 128, 256, 512} with the tighter `< 1e-12` tolerance @TheTom suggested. Verified: orthogonality error 5e-15 to 4e-14 across all tested d. Reconstruction norm ratio under default shrinkage=1.0 averages 1.253 = √(π/2) at every d, exactly matching the closed form.

devYRPauli · 2026-05-28T15:49:38Z

Code pushed in fd18dbe (rename + default 1.0 + docstring with 2/π derivation + orthogonality contract ||S Sᵀ − I||_F < 1e-12 at d ∈ {64, 128, 256, 512}). PR description rewritten with the variance framing.

test_dequantized_has_correct_scale is the one open item — at default shrinkage=1.0 it averages 1.253 = √(π/2) exactly, so let me know your call between [1.20, 1.30] (norm-ratio tightening) and pivoting to the inner-product unbiasedness invariant. Either is a one-line push.

…nt for prod Expand the §QJL note to state explicitly that production drops the QJL stage on both K and V. Name TheTom/llama-cpp-turboquant as the production path and document the recommended config (--cache-type-k q8_0 --cache-type-v turbo3). Keep the 5-group consensus citation and add guidance for downstream users: use TurboQuantMSE for V or straight PolarQuant for K. Only enable QJL classes for paper reproducibility or K-side research below 8-bit, and validate at target context length since QJL noise historically accumulates past ~16K. Follow-up to #93 (QJL math fix). The reference impl is now correct, but the production guidance hadn't been said this plainly anywhere. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…JL fix Pulls in 3 upstream commits since merge-base 1224fef: - c46f6b9 docs(papers): block-selector sparse attention WIP log - 0cb20bc fix(qjl): orthogonal projection + sqrt(d) scale (TheTom TheTom#93) - 280b466 README: mark QJL as reference-only Clean auto-merge. Only file touched by both sides was turboquant.py; upstream added a `shrinkage` kwarg to TurboQuant.dequantize that slots in alongside our V-norm/MSE accounting fix without conflict. Our fork-local commits retained: V-norm in memory_stats, SeedSequence PRNG, MSE compressed_size_bits, QJL regression test, rotation tests, ruff config + CI drop, OutlierTurboQuant.calibrate, HIP/AMD NaN doc. PR TheTom#91 (ship/pr-90-curated) — TheTom's curated cherry-pick of 5 of these — remains open; once it merges to upstream/main we'll want to rebase/reset to drop redundant commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

devYRPauli mentioned this pull request May 28, 2026

PolarQuant KV cache compression (TurboQuant, ICLR 2026) ml-explore/mlx-lm#1060

Open

TheTom merged commit 0cb20bc into TheTom:main May 28, 2026

devYRPauli mentioned this pull request May 29, 2026

TQ3_0: norm correction + zero block handling + full Metal GPU support Aaryan-Kapoor/llama.cpp#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(qjl): use orthogonal projection and sqrt(d) scale factor#93

fix(qjl): use orthogonal projection and sqrt(d) scale factor#93
TheTom merged 2 commits into
TheTom:mainfrom
devYRPauli:qjl/orthogonal-projection-and-scale-fix

devYRPauli commented May 28, 2026 •

edited

Loading

Uh oh!

junleen commented May 28, 2026 via email

Uh oh!

TheTom commented May 28, 2026

Uh oh!

devYRPauli commented May 28, 2026

Uh oh!

devYRPauli commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

devYRPauli commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Fix

Empirical impact (M1 Pro 16 GB, Qwen 4B, K5/V4 hybrid)

Reproducer

Uh oh!

junleen commented May 28, 2026 via email

Uh oh!

TheTom commented May 28, 2026

Uh oh!

devYRPauli commented May 28, 2026

Uh oh!

devYRPauli commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

devYRPauli commented May 28, 2026 •

edited

Loading