Skip to content

fix(qjl): use orthogonal projection and sqrt(d) scale factor#93

Merged
TheTom merged 2 commits into
TheTom:mainfrom
devYRPauli:qjl/orthogonal-projection-and-scale-fix
May 28, 2026
Merged

fix(qjl): use orthogonal projection and sqrt(d) scale factor#93
TheTom merged 2 commits into
TheTom:mainfrom
devYRPauli:qjl/orthogonal-projection-and-scale-fix

Conversation

@devYRPauli

@devYRPauli devYRPauli commented May 28, 2026

Copy link
Copy Markdown
Contributor

What

The QJL stage in turboquant/qjl.py used a Gaussian random matrix S ~ N(0,1)^(d×d) for sign(S · x) with dequantization scale sqrt(π/2) / d. Both that pairing and the corrected orthogonal-S / sqrt(π/2) / sqrt(d) pairing are unbiased on the inner product ⟨x̂, y⟩ — the QJL stage isn't "off." The actual mechanism is variance, not magnitude. On real LLM KV vectors with head_dim=128:

  1. The Gaussian projection introduces variance of order ||r||² / d per reconstructed dimension. Attention accumulation over thousands of tokens drives this into word-loop degeneration ("genetic genetic genetic...") the moment QJL is enabled.
  2. Both estimators produce the same expected reconstruction norm ≈ √(π/2)·||x||. The sqrt(π/2) / d scale is the unbiased form for Gaussian S; sqrt(π/2) / sqrt(d) is the unbiased form for orthogonal S. The fix isn't a magnitude correction — it's swapping the projection family and matching the scale to it so variance collapses.

Fix

  • Generate S as a sign-corrected random orthogonal matrix via QR. Orthogonal projections preserve norms and inner products exactly; reconstruction variance collapses sharply.
  • Use the matching unbiased scale sqrt(π/2) / sqrt(d) so the orthogonal-S estimator stays unbiased on ⟨x̂, y⟩.
  • Add a shrinkage parameter (default 1.0 = classical paper-faithful unbiased QJL). MMSE-optimal value is 2/np.pi ≈ 0.6366 — derivation in the docstring (from E[||x̂||²] = (π/2)·||x||² and E[⟨x̂, x⟩] = ||x||²).
  • Enforce the orthogonality contract in __init__: ||S Sᵀ − I||_F < 1e-10. New TestQJLProjection.test_projection_matrix_is_orthogonal at d ∈ {64, 128, 256, 512} with the tighter < 1e-12 tolerance.

Empirical impact (M1 Pro 16 GB, Qwen 4B, K5/V4 hybrid)

  • Needle-in-haystack retrieval at 16K tokens: 0% → 100% with these QJL fixes combined with K5/V4 Hybrid KV cache.
  • Verified orthogonality of S to machine precision: ||S Sᵀ − I||_F ≈ 5e-15 to 4e-14 across d ∈ {64, 128, 256, 512}.
  • At default shrinkage=1.0, reconstruction norm ratio averages 1.253 = √(π/2) at every tested d, exactly matching the closed-form E[||x̂||²] = (π/2)·||x||².

Reproducer

Full benchmarks, logs and the post-mortem write-up: https://github.com/devYRPauli/turboquant-m1pro-evaluation

Note: the parallel norm_correction work landed in PolarQuant on 86bcbbe — this PR is independent of that and lands on the same main.

The QJL stage used a Gaussian random matrix S ~ N(0,1)^(d×d) for
sign(S · x). Gaussian projections introduce variance of order ||r||^2/d
per dimension; for head_dim=128, this is large relative to signal, and
attention accumulation drives generation into word-loop degeneration
once QJL is enabled on real LLM KV vectors.

Fix:
* Replace Gaussian S with a random orthogonal matrix via QR
  decomposition (sign-corrected to give a proper rotation). Orthogonal
  matrices preserve norms and inner products and reduce reconstruction
  variance sharply.
* Correct the dequantization scale from sqrt(pi/2)/d to sqrt(pi/2)/sqrt(d).
  For d=128, the previous formula was ~11x too small, which effectively
  disabled the QJL correction term even when the stage was enabled.
* Add a damping parameter (default 0.7) to dequantize(), validated as
  the stable operating point in M1 Pro 16K-context needle benchmarks
  on Qwen 4B with K5/V4 hybrid quantization.

Empirical impact on M1 Pro (16 GB, Qwen 4B):
* Needle retrieval at 16K tokens: 0% → 100% with these QJL fixes
  combined with K5/V4 Hybrid configuration.
* Verified orthogonality: ||S S^T - I||_F < 1e-14.
* Existing tests pass (e.g., test_dequantized_has_correct_scale
  reports avg norm ratio 0.877 at d=128, well within [0.5, 2.0]).

Refs: turboquant-m1pro-evaluation reproducer and post-mortem at
https://github.com/devYRPauli/turboquant-m1pro-evaluation
@junleen

junleen commented May 28, 2026 via email

Copy link
Copy Markdown

devYRPauli added a commit to devYRPauli/turboquant-m1pro-evaluation that referenced this pull request May 28, 2026
Add an Upstream Contributions section to README.md pointing at:
* TheTom/turboquant_plus#93 (QJL orthogonal projection + sqrt(d) scale)
* Aaryan-Kapoor/llama.cpp#1 (tq3_0 norm correction + Metal kernels)
* wxtry's 70e45b7e which independently fixed GGML context sizing
  upstream in llama-cpp-turboquant on 2026-03-29

Add inline upstream-status notes to FINDINGS.md under each
corresponding finding.

Add CLAUDE.md and FINAL_AUDIT_PROMPT.md to .gitignore: both were
internal prompts used to assemble this repo and are not findings.
@TheTom

TheTom commented May 28, 2026

Copy link
Copy Markdown
Owner

Thanks. Math fix is right, and unbiased reconstruction is a real win over what was there. Couple of asks before merge.

1. Damping default.

Closed form: for unbiased orthogonal QJL with the √(d) scale, E[||x̂||²] = (π/2)·||x||² exactly, so MMSE-optimal shrinkage is α* = 2/π ≈ 0.6366. Empirical sweep at d ∈ {64,128,256,512} matches to grid resolution. damping=0.7 is ~2% off optimum.

Could you:

  • Default the new kwarg to 1.0 (unbiased, paper-faithful, backward-compatible for TurboQuant.dequantize).
  • Rename damping to shrinkage. Document 2/np.pi as the MMSE-optimal value in the docstring with the one-line derivation.

Callers who don't pass an explicit kwarg should keep getting classical unbiased QJL.

2. Framing nit on the PR body.

Original code wasn't ~11× too small / disabled. It was ~1.3× too large in norm and high-variance. Both estimators are unbiased on ⟨x̂, y⟩; the gap is variance, not magnitude. Worth a one-line correction in the description. The new framing is the actual mechanism, and it matches the "QJL eliminates bias but explodes variance" finding from turbo4-resurrection.md, now quantified.

3. Tighten tests in this PR.

Existing test_qjl.py passed the destructive code. Please add:

  • assert np.linalg.norm(qjl.S @ qjl.S.T - np.eye(d), 'fro') < 1e-12 in the constructor or as a test (orthogonality contract).
  • Tighten test_dequantized_has_correct_scale from [0.5, 2.0] to [0.95, 1.05] for the unbiased classical recovery.

4. Heads-up, not a blocker.

I'll update the README §QJL to flag this stage as reference-only. Production TheTom/llama-cpp-turboquant drops QJL on both K and V (recommended config --cache-type-k q8_0 --cache-type-v turbo3), per the 5-group consensus in turbo4-resurrection.md. Your K5/V4 result is the first interesting K-side QJL data I've seen. Would be great to see it pushed to 64K+ context where the variance mechanism historically manifests (buun's CUDA turbo4 degraded -0.28% at 2K to +3.69% at 64K with unit damping + broken Gaussian S; your math fix + shrinkage might mitigate). Separate issue, not a blocker for this fix.

@devYRPauli

Copy link
Copy Markdown
Contributor Author

Thanks. Closed-form derivation is cleaner; using that. Working through:

1. Rename + default 1.0 — yes. dampingshrinkage, default 1.0, docstring documents 2/np.pi as MMSE-optimal with the one-line derivation. TurboQuant.dequantize callers without the kwarg get classical unbiased QJL.

2. Framing fix on PR body — correct, mine was wrong. Both estimators are unbiased on ⟨x̂, y⟩; the gap is variance, not magnitude. Rewriting the description: original was high-variance, new estimator collapses variance, which is what lets the two-stage design deliver the paper's ~64% MSE reduction.

3. Tests — adding the orthogonality contract (||S Sᵀ − I||_F < 1e-12) in __init__ and as a separate test.

Quick math check on test_dequantized_has_correct_scale: under the new shrinkage=1.0, E[||x̂||/||x||] ≈ √(π/2) ≈ 1.253, so [0.95, 1.05] would fail. Two options that match your intent:

  • Tighten to [1.20, 1.30] (match classical-QJL norm ratio), or
  • Repurpose to the unbiased-contract on the inner product itself: |E[⟨x̂, y⟩] − ⟨x, y⟩| < 3·SE over many trials. This is what the paper actually proves; test_inner_product_unbiased_single_side already covers d=256 — could extend to d ∈ {64, 128, 256, 512} and replace the norm test, or keep both.

Which do you prefer?

4. 64K+ follow-up — noted, separate issue. The variance-accumulation mechanism behind buun's −0.28% at 2K → +3.69% at 64K is exactly what the orthogonal projection should suppress; would be useful to confirm. Setting it up on remote hardware (M1 Pro 16 GB constrained me to 16K).

Pushing the rename + default 1.0 + docstring + orthogonality test now, and updating the PR description. Will tag once code is up so you can pick the test option.

Per @TheTom's review on PR TheTom#93:

* Rename `damping` kwarg to `shrinkage` on QJL.dequantize and
  TurboQuant.dequantize. Default 1.0 — classical paper-faithful
  unbiased estimator. Existing callers without the kwarg get
  classical unbiased QJL (backward-compatible).
* Docstrings document `2/np.pi ≈ 0.6366` as the MMSE-optimal
  shrinkage with the one-line derivation from E[||x̂||²] =
  (π/2)·||x||² and E[⟨x̂, x⟩] = ||x||².
* Add the orthogonality contract:
  - `__init__` asserts `||S Sᵀ − I||_F < 1e-10` (cheap defensive
    check; required for the unbiased estimator).
  - New `TestQJLProjection.test_projection_matrix_is_orthogonal`
    at d ∈ {64, 128, 256, 512} with the tighter `< 1e-12`
    tolerance @TheTom suggested.

Verified: orthogonality error 5e-15 to 4e-14 across all tested d.
Reconstruction norm ratio under default shrinkage=1.0 averages
1.253 = √(π/2) at every d, exactly matching the closed form.
@devYRPauli

Copy link
Copy Markdown
Contributor Author

Code pushed in fd18dbe (rename + default 1.0 + docstring with 2/π derivation + orthogonality contract ||S Sᵀ − I||_F < 1e-12 at d ∈ {64, 128, 256, 512}). PR description rewritten with the variance framing.

test_dequantized_has_correct_scale is the one open item — at default shrinkage=1.0 it averages 1.253 = √(π/2) exactly, so let me know your call between [1.20, 1.30] (norm-ratio tightening) and pivoting to the inner-product unbiasedness invariant. Either is a one-line push.

@TheTom TheTom merged commit 0cb20bc into TheTom:main May 28, 2026
TheTom added a commit that referenced this pull request May 28, 2026
…nt for prod

Expand the §QJL note to state explicitly that production drops the QJL
stage on both K and V. Name TheTom/llama-cpp-turboquant as the production
path and document the recommended config
(--cache-type-k q8_0 --cache-type-v turbo3). Keep the 5-group consensus
citation and add guidance for downstream users: use TurboQuantMSE for V
or straight PolarQuant for K. Only enable QJL classes for paper
reproducibility or K-side research below 8-bit, and validate at target
context length since QJL noise historically accumulates past ~16K.

Follow-up to #93 (QJL math fix). The reference impl is now correct, but
the production guidance hadn't been said this plainly anywhere.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
brosequist pushed a commit to brosequist/turboquant_plus that referenced this pull request May 28, 2026
…JL fix

Pulls in 3 upstream commits since merge-base 1224fef:
- c46f6b9 docs(papers): block-selector sparse attention WIP log
- 0cb20bc fix(qjl): orthogonal projection + sqrt(d) scale (TheTom TheTom#93)
- 280b466 README: mark QJL as reference-only

Clean auto-merge. Only file touched by both sides was turboquant.py;
upstream added a `shrinkage` kwarg to TurboQuant.dequantize that slots
in alongside our V-norm/MSE accounting fix without conflict.

Our fork-local commits retained: V-norm in memory_stats, SeedSequence
PRNG, MSE compressed_size_bits, QJL regression test, rotation tests,
ruff config + CI drop, OutlierTurboQuant.calibrate, HIP/AMD NaN doc.

PR TheTom#91 (ship/pr-90-curated) — TheTom's curated cherry-pick of 5 of
these — remains open; once it merges to upstream/main we'll want to
rebase/reset to drop redundant commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants