fix(qjl): use orthogonal projection and sqrt(d) scale factor#93
Conversation
The QJL stage used a Gaussian random matrix S ~ N(0,1)^(d×d) for sign(S · x). Gaussian projections introduce variance of order ||r||^2/d per dimension; for head_dim=128, this is large relative to signal, and attention accumulation drives generation into word-loop degeneration once QJL is enabled on real LLM KV vectors. Fix: * Replace Gaussian S with a random orthogonal matrix via QR decomposition (sign-corrected to give a proper rotation). Orthogonal matrices preserve norms and inner products and reduce reconstruction variance sharply. * Correct the dequantization scale from sqrt(pi/2)/d to sqrt(pi/2)/sqrt(d). For d=128, the previous formula was ~11x too small, which effectively disabled the QJL correction term even when the stage was enabled. * Add a damping parameter (default 0.7) to dequantize(), validated as the stable operating point in M1 Pro 16K-context needle benchmarks on Qwen 4B with K5/V4 hybrid quantization. Empirical impact on M1 Pro (16 GB, Qwen 4B): * Needle retrieval at 16K tokens: 0% → 100% with these QJL fixes combined with K5/V4 Hybrid configuration. * Verified orthogonality: ||S S^T - I||_F < 1e-14. * Existing tests pass (e.g., test_dequantized_has_correct_scale reports avg norm ratio 0.877 at d=128, well within [0.5, 2.0]). Refs: turboquant-m1pro-evaluation reproducer and post-mortem at https://github.com/devYRPauli/turboquant-m1pro-evaluation
|
您好,这是来自QQ邮箱的自动回复邮件。我将尽快阅读您的邮件,谢谢!
|
Add an Upstream Contributions section to README.md pointing at: * TheTom/turboquant_plus#93 (QJL orthogonal projection + sqrt(d) scale) * Aaryan-Kapoor/llama.cpp#1 (tq3_0 norm correction + Metal kernels) * wxtry's 70e45b7e which independently fixed GGML context sizing upstream in llama-cpp-turboquant on 2026-03-29 Add inline upstream-status notes to FINDINGS.md under each corresponding finding. Add CLAUDE.md and FINAL_AUDIT_PROMPT.md to .gitignore: both were internal prompts used to assemble this repo and are not findings.
|
Thanks. Math fix is right, and unbiased reconstruction is a real win over what was there. Couple of asks before merge. 1. Damping default. Closed form: for unbiased orthogonal QJL with the √(d) scale, E[||x̂||²] = (π/2)·||x||² exactly, so MMSE-optimal shrinkage is α* = 2/π ≈ 0.6366. Empirical sweep at d ∈ {64,128,256,512} matches to grid resolution. Could you:
Callers who don't pass an explicit kwarg should keep getting classical unbiased QJL. 2. Framing nit on the PR body. Original code wasn't ~11× too small / disabled. It was ~1.3× too large in norm and high-variance. Both estimators are unbiased on ⟨x̂, y⟩; the gap is variance, not magnitude. Worth a one-line correction in the description. The new framing is the actual mechanism, and it matches the "QJL eliminates bias but explodes variance" finding from turbo4-resurrection.md, now quantified. 3. Tighten tests in this PR. Existing
4. Heads-up, not a blocker. I'll update the README §QJL to flag this stage as reference-only. Production TheTom/llama-cpp-turboquant drops QJL on both K and V (recommended config |
|
Thanks. Closed-form derivation is cleaner; using that. Working through: 1. Rename + default 1.0 — yes. 2. Framing fix on PR body — correct, mine was wrong. Both estimators are unbiased on 3. Tests — adding the orthogonality contract ( Quick math check on
Which do you prefer? 4. 64K+ follow-up — noted, separate issue. The variance-accumulation mechanism behind buun's Pushing the rename + default 1.0 + docstring + orthogonality test now, and updating the PR description. Will tag once code is up so you can pick the test option. |
Per @TheTom's review on PR TheTom#93: * Rename `damping` kwarg to `shrinkage` on QJL.dequantize and TurboQuant.dequantize. Default 1.0 — classical paper-faithful unbiased estimator. Existing callers without the kwarg get classical unbiased QJL (backward-compatible). * Docstrings document `2/np.pi ≈ 0.6366` as the MMSE-optimal shrinkage with the one-line derivation from E[||x̂||²] = (π/2)·||x||² and E[⟨x̂, x⟩] = ||x||². * Add the orthogonality contract: - `__init__` asserts `||S Sᵀ − I||_F < 1e-10` (cheap defensive check; required for the unbiased estimator). - New `TestQJLProjection.test_projection_matrix_is_orthogonal` at d ∈ {64, 128, 256, 512} with the tighter `< 1e-12` tolerance @TheTom suggested. Verified: orthogonality error 5e-15 to 4e-14 across all tested d. Reconstruction norm ratio under default shrinkage=1.0 averages 1.253 = √(π/2) at every d, exactly matching the closed form.
|
Code pushed in
|
…nt for prod Expand the §QJL note to state explicitly that production drops the QJL stage on both K and V. Name TheTom/llama-cpp-turboquant as the production path and document the recommended config (--cache-type-k q8_0 --cache-type-v turbo3). Keep the 5-group consensus citation and add guidance for downstream users: use TurboQuantMSE for V or straight PolarQuant for K. Only enable QJL classes for paper reproducibility or K-side research below 8-bit, and validate at target context length since QJL noise historically accumulates past ~16K. Follow-up to #93 (QJL math fix). The reference impl is now correct, but the production guidance hadn't been said this plainly anywhere. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…JL fix Pulls in 3 upstream commits since merge-base 1224fef: - c46f6b9 docs(papers): block-selector sparse attention WIP log - 0cb20bc fix(qjl): orthogonal projection + sqrt(d) scale (TheTom TheTom#93) - 280b466 README: mark QJL as reference-only Clean auto-merge. Only file touched by both sides was turboquant.py; upstream added a `shrinkage` kwarg to TurboQuant.dequantize that slots in alongside our V-norm/MSE accounting fix without conflict. Our fork-local commits retained: V-norm in memory_stats, SeedSequence PRNG, MSE compressed_size_bits, QJL regression test, rotation tests, ruff config + CI drop, OutlierTurboQuant.calibrate, HIP/AMD NaN doc. PR TheTom#91 (ship/pr-90-curated) — TheTom's curated cherry-pick of 5 of these — remains open; once it merges to upstream/main we'll want to rebase/reset to drop redundant commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
What
The QJL stage in
turboquant/qjl.pyused a Gaussian random matrixS ~ N(0,1)^(d×d)forsign(S · x)with dequantization scalesqrt(π/2) / d. Both that pairing and the corrected orthogonal-S /sqrt(π/2) / sqrt(d)pairing are unbiased on the inner product⟨x̂, y⟩— the QJL stage isn't "off." The actual mechanism is variance, not magnitude. On real LLM KV vectors withhead_dim=128:||r||² / dper reconstructed dimension. Attention accumulation over thousands of tokens drives this into word-loop degeneration ("genetic genetic genetic...") the moment QJL is enabled.≈ √(π/2)·||x||. Thesqrt(π/2) / dscale is the unbiased form for GaussianS;sqrt(π/2) / sqrt(d)is the unbiased form for orthogonalS. The fix isn't a magnitude correction — it's swapping the projection family and matching the scale to it so variance collapses.Fix
Sas a sign-corrected random orthogonal matrix via QR. Orthogonal projections preserve norms and inner products exactly; reconstruction variance collapses sharply.sqrt(π/2) / sqrt(d)so the orthogonal-S estimator stays unbiased on⟨x̂, y⟩.shrinkageparameter (default1.0= classical paper-faithful unbiased QJL). MMSE-optimal value is2/np.pi ≈ 0.6366— derivation in the docstring (fromE[||x̂||²] = (π/2)·||x||²andE[⟨x̂, x⟩] = ||x||²).__init__:||S Sᵀ − I||_F < 1e-10. NewTestQJLProjection.test_projection_matrix_is_orthogonalatd ∈ {64, 128, 256, 512}with the tighter< 1e-12tolerance.Empirical impact (M1 Pro 16 GB, Qwen 4B, K5/V4 hybrid)
Sto machine precision:||S Sᵀ − I||_F ≈ 5e-15 to 4e-14acrossd ∈ {64, 128, 256, 512}.shrinkage=1.0, reconstruction norm ratio averages1.253 = √(π/2)at every testedd, exactly matching the closed-formE[||x̂||²] = (π/2)·||x||².Reproducer
Full benchmarks, logs and the post-mortem write-up: https://github.com/devYRPauli/turboquant-m1pro-evaluation
Note: the parallel
norm_correctionwork landed inPolarQuanton86bcbbe— this PR is independent of that and lands on the samemain.