Replies: 64 comments 184 replies
-
|
It is also something other vendors out there are championing such as nvidia (KTVC): Article: https://venturebeat.com/orchestration/nvidia-shrinks-llm-memory-20x-without-changing-model-weights More links within that reference. It would be great to hear from the developers what is ahead regarding such features! |
Beta Was this translation helpful? Give feedback.
-
|
I've got something going here: unixsysdev/llama-turboquant@16e93d5 PS: Closer to optimal. |
Beta Was this translation helpful? Give feedback.
-
|
Working TurboQuant Implementation Available Memory layout: |
Beta Was this translation helpful? Give feedback.
-
|
I have a working implementation of TurboQuant as native KV cache types in llama.cpp with Metal GPU support. Repo: https://github.com/TheTom/turboquant_plus What's working:
Benchmarks (M5 Max 128GB):
Compression target is met. Speed gap is from the unoptimized WHT rotation (O(d^2) per block). Working on Hadamard rotation (O(d log d)) and fused flash attention dequant next. Gotcha for anyone else implementing this: Metal JIT silently falls back to CPU if you Happy to collaborate with anyone else working on this. |
Beta Was this translation helpful? Give feedback.
-
Couldn't wait, so I spun something up; hopefully, it helps the final implementation. Feel free to cherry-pick :)
Working TurboQuant TQ3_0 implementation (CPU, both K+V cache) Branch: https://github.com/Aaryan-Kapoor/llama.cpp/tree/turboquant-tq3_0 Implements Algorithm 1 (TurboQuant_mse) from the paper as
Benchmarks (Qwen3.5-35B-A3B Q4_K_M, CPU, 4 threads):
Output is identical to f16 baseline on the 35B model at temperature 0. Quality degrades on very small models (0.6B) as expected - the paper's claims hold for reasonably-sized models. Usage:
Known limitations:
|
Beta Was this translation helpful? Give feedback.
-
|
Got CUDA + Flash Attention turbo3 working on RTX 5090. Ported @TheTom's Metal turbo3 kernels to CUDA with full Flash Attention support for both K and V. Hardware: RTX 5090 32GB, CUDA 12.8, sm_120, WSL2 Ubuntu 24.04 NIAH: 6/6 exact retrieval Qwen3.5-27B is a hybrid architecture — only 16 of 64 layers have KV cache (the GatedAttention layers). 16 layers × 4 KV heads × 256 head_dim. What's implemented (15 files, 4 new + 11 modified): All dispatch paths: convert, set-rows, get-rows, cpy, MUL_MAT routing (turbo3 excluded from mmvq/mmq, routed through dequant-then-cublas for MUL_MAT) Build: Known limitations: |
Beta Was this translation helpful? Give feedback.
-
|
anyone working on Vulkan backend? |
Beta Was this translation helpful? Give feedback.
-
|
https://github.com/spiritbuun/llama-cpp-turboquant-cuda This is a fork of Tom's implementation with CUDA support. Results look promising.
As per their twitter account spiritbuun. |
Beta Was this translation helpful? Give feedback.
-
|
So it's already in the main repo of llama.cpp? |
Beta Was this translation helpful? Give feedback.
-
|
Is no one else seeing the obvious here? |
Beta Was this translation helpful? Give feedback.
-
Engineering Findings from 8-Model TurboQuant BenchmarkWe independently implemented TurboQuant from scratch (Python/NumPy, 49 tests, distortion matches paper ±15%) and ran systematic benchmarks across 8 models from GPT-2 (124M) to Qwen2.5-7B (7.6B). Sharing findings that may be useful for the llama.cpp integration: Finding 1: K/V Norm DisparityThe paper does not discuss this. Modern LLMs have dramatically different Key vs Value vector magnitudes:
Since quantization error scales with norm squared, K needs far more bits than V. The K/V ratio predicts the optimal bit budget: Finding 2: MSE > Prod for AttentionThe paper recommends TurboQuantProd (QJL residual) for Keys. Our tests show MSE for both K and V works better in practice:
QJL adds variance that softmax amplifies. Low variance (MSE) beats unbiasedness (Prod). Finding 3: Outlier-Aware Mixed Precision~5-20% of K channels (especially Layer 0) have 10-100x larger RMS than median. Storing outlier channels at 8-bit, rest at 3-bit:
Finding 4: Compressed Storage VerifiedActual memory savings: GPT-2 89% reduction, 9x compression, zero PPL impact. RepoFull implementation, benchmarks, and data: https://github.com/scos-lab/turboquant ~2,500 LOC Python, 49 tests, MIT license. Hope these findings help with the llama.cpp integration. |
Beta Was this translation helpful? Give feedback.
-
|
I've been working on extending unixsysdev's tq3_0 implementation with V cache support and flash attention. Repo here: https://github.com/animehacker/llama-turboquant What this adds on top of unixsysdev's work: Normalization fix (1/32 → 1/√32 for the asymmetric K-side WHT) 72K context with tq3_0 K+V (4.57x compression) Paper with implementation details: https://oliverchurch.com/turboquant-for-ggml-achieving-4.57x-kv-cache-compression-in-llama.cpp.html |
Beta Was this translation helpful? Give feedback.
-
|
Seems like this tq3 quantization works well. When could it be used on model weights to replace the useless -q3- models? |
Beta Was this translation helpful? Give feedback.
-
|
Update Mar 30th 2026: WHT + QJL + MSE is the solution! In @AmesianX 's implementation, PPL decreased after introducing QJL. At first I thought this is due to @AmesianX comments, i.e., The fix was using independent sign patterns for MSE WHT and QJL SRHT. Since the only difference is WHT (Walsh-Hadamard Transform), I implemented another version replace random rotation with WHT (https://github.com/Arclabs001/YATQ/blob/main/turboquant_wht.py) Test Setup
Perplexity Comparison (Random Rotation vs WHT)
Attention Score Metrics Comparison
Observations
Finally, why "random rotation" + QJL makes it worse but WHT + QJL makes it better is still a mystery to me. As in the paper, the author says they used random rotation. (this is infer from claude, maybe explain something) Mar 28th Hey everyone! I just finished reproducing TurboQuant (ICLR 2026) purely in torch. This repo supports real QJL by rewriting whole attention and forward process for QWen3 models. And I found the result independantly: In the same bits budget, k-bit MSE is better than k-1 bit MSE + 1 bit QJL Repo link: https://github.com/arclabs001/YATQ BackgroundTurboQuant proposes a clever way to quantize KV caches:
The paper claims QJL eliminates quantization bias, which sounds great in theory. So I implemented both stages and ran extensive tests. The Surprising PartQJL actually hurts performance in practice. Here's what I found on Qwen3-1.7B (4K context), top-1 token consistency rate drops:
MSE-only consistently wins on Top-1 token matching. The gap is huge at low bits and still noticeable at 8-bit. What's Going On?The theory says QJL = no bias. That's true! But here's the trade-off:
QJL eliminates bias but explodes variance. And for attention, variance is worse than bias! Why? Softmax is tolerant to uniform bias: But variance randomly perturbs each score, which messes up Top-K ranking: So you get "unbiased" estimates that give you the wrong Top-1 token more often. Another Thing: Both Keys and Values Don't Need QJLI also tested whether V should use QJL. Short answer: nope.
Values only do weighted sum, so softmax naturally averages out per-vector errors. QJL wastes 1 bit on useless residual info. My TakeawayFor KV cache quantization:
The implementation is open source if anyone wants to dig deeper or challenge these findings: https://github.com/arclabs001/YATQ Would love to hear thoughts from the community! Did I miss something? Are there scenarios where QJL actually shines? |
Beta Was this translation helpful? Give feedback.
-
|
Why not to compress the weights? For small quants there are very few values per 4/3 bits (16 or 8), it means there are a lot of equal values. Very simple encoding with bit strings easily reduce model's size twice or even more. It requires some computation to uncompress, but it is done in cache and takes a small time when inference is not compute bound, but memory throughput, so there is an extra time for decompression. Prompt processing will be a bit slower, but token generation increases twice or more. A big leap to ignore it. |
Beta Was this translation helpful? Give feedback.
-
TurboQuant CUDA Optimized & BenchedSummaryFull optimized CUDA implementation of TurboQuant KV cache compression for llama.cpp, targeting NVIDIA GPUs (SM86+). All 4 turbo types implemented with native flash attention, parallel SET_ROWS encoding, and aggressive kernel optimizations. Validated across 4 GPUs spanning 3 architecture generations (Ampere, Ada, Blackwell). Repo: Madreag/turbo3-cuda branch Forked from TheTom/llama-cpp-turboquant ( Key results (RTX 5090, Qwen 3.5 27B Q6_K):
1. Decode Performance (RTX 5090, 27B Q6_K)
Measured with 2. Prefill Context Scaling (tok/s)
Prefill auto-dequants turbo→fp16 and uses MMA/TILE kernels. All types track q8_0 with negligible overhead. 3. Quality: Perplexity (wikitext-103, 50 chunks)
turbo3/turbo4 stay within 3% of q8_0 even at 32K. turbo2/turbo1.5 delta grows with context — inherent to 2-bit quantization (proven by threshold control test: PPL is bit-identical at 1e-2 and 1e-6 thresholds). Note: PPL baseline differs between wikitext-103 (used here, q8_0=6.1825) and wikitext-2 (used in README, q8_0=6.759) — this is expected from different evaluation corpora. Relative ranking is consistent (turbo4 < turbo3 < turbo2 < turbo1.5), though absolute deltas differ between corpora. 4. Quality: KL Divergence vs f16 (100 prompts, 27B Q6_K)
5. Sparse V Skip OptimizationSkips V dequantization for positions with negligible attention weights. Control test proves zero quality impact:
Skip rates (Qwen3-1.7B, direct attention weight measurement):
6. Norm Correction (TheTom turbo3 + spiritbuun turbo4)Rescales reconstructed vector norm to match original magnitude. Impact:
7. Asymmetric K/V Quality Matrix (PPL ctx=512)
V type dominates PPL (columns vary more than rows). K=turbo3/V=turbo3 ≈ K=q8_0/V=turbo3. 8. Impact of CUDA Kernel OptimizationsMeasured by comparing the base TurboQuant implementation ( RTX 5090 (27B Q6_K)
RTX 3090 (9B Q8_0)
RTX 4090M (9B Q8_0)
Pattern: Short context is identical (weight-loading bound). Optimizations show at 32K+ where KV bandwidth dominates: +13-69% improvement across 4 GPUs. Advantage grows with context depth (64K: +34-47%). turbo4 benefits most because its larger KV amplifies the unoptimized dequant cost. PPL (3090): q8_0 identical (9.3731). turbo4/turbo3 within noise. Optimized turbo2 PPL = 9.61 vs base 9.84 (2.3% better). 9. Per-GPU Performance DetailsRTX 3090 Ti (SM86, OC +2200 mem, Qwen 3.5 9B Q8_0)Decode speed (
turbo2 beats q8_0 by 5.3% at 32K (81.58 vs 77.44). turbo2 64K = 72.79 where q8_0 OOMs. PPL (base vs optimized): q8_0 identical (8.5249). turbo3 within noise (+0.2%). NIAH (25 tests, 4K-64K, max_tokens=4000): q8_0=turbo3=turbo2=92%, turbo1.5=100%. 4090M (SM89, 16 GB, Qwen 3.5 9B Q8_0)Speed in Section 8 above. PPL (base vs optimized): q8_0 identical (9.3737). turbo3/turbo4 within noise. NIAH (max_tokens=4000): q8_0=turbo3=100%, turbo2=95%, turbo1.5=50%. 10. Cross-GPU Validation (1,351+ iterations, zero failures)
11. NIAH (Needle-in-a-Haystack)RTX 5090 (Qwen 3.5 9B Q8_0, max_tokens=4000):
Earlier testing data (q8_0=85%, turbo3=70%) was caused by Qwen 3.5 thinking model exhausting max_tokens=1000. Raising to 4000 resolves all token exhaustion failures. RTX 3090 Ti (25 tests 4K-64K, max_tokens=4000): q8_0=turbo3=turbo2=92%, turbo1.5=100%. With sufficient token budget, all types converge — remaining failures at 32K/64K depth 10% are model-specific, not turbo degradation. RTX 3090 (20 tests 4K-32K, max_tokens=4000): q8_0=100%, turbo3=100%, turbo2=95% (1 failure at 32K/10%), turbo1.5=60%. RTX 4090M (20 tests 4K-32K, max_tokens=4000): q8_0=100%, turbo3=100%, turbo2=95%, turbo1.5=50%. Previous 90% scores at max_tokens=2000 were token exhaustion — resolved with 4000. turbo1.5 improved from 35% to 50% but remains degraded (real 2-bit quality issue). 12. Optimizations Applied
13. Limitations
14. Configuration Recommendations
15. Attribution
Repo: Madreag/turbo3-cuda branch |
Beta Was this translation helpful? Give feedback.
-
|
Amazing. As we move into the experimental phase with these kernels, I’ve been reflecting on how TurboQuant could serve as the 'structural glue' for an even broader efficiency stack. Would this combination make sense in the future?
Each of these pieces multiply a lot models performance. If we eventually combine all those, we might be looking at a significant leap in local LLM feasibility. Would this potentially let us fit 70B+ models into commodity hardware with let's say 16GB VRAM, without losing much speed or intelligence? Quick AI-assisted calculations lead me to 2000x performance improvements. Seems a bit mind-blowing. Am I oversimplifying that synergy? Would llama.cpp be able to run all this? Has anyone else explored this 'Full-Stack Efficiency' horizon? |
Beta Was this translation helpful? Give feedback.
-
|
Except, recurrent memory comes at a qualitative cost, I don't think moving
away from transformers is necessary.
We already support SSM/Hymba and you can run LFM style models today.
Besides this can continuous context can also be emulated externally via
continuous tree batching + pruning outside the model boundary - for
instance, see
https://docs.lloyal.ai/learn/introduction - uses kv seqIds for exactly that
for static transformers.
…On Wed, 1 Apr 2026 at 7:08 pm, Jairo Llopis ***@***.***> wrote:
Amazing.
As we move into the experimental phase with these kernels, I’ve been
reflecting on how TurboQuant could serve as the 'structural glue' for an
even broader efficiency stack. Would this combination make sense in the
future?
- *BitNet (1.58b)* at the training phase: Ternary-native models using
int sums instead of float multiplications.
- *TurboQuant* at the fine-tuning phase: Ensuring that data
(especially the KV Cache) flows between memory and compute at sub-3bit
precision without bottlenecks.
- *LFM* (Liquid Foundation Models) as the 'Brain Logic': Moving away
from static Transformer blocks toward a continuous, fluid state. This could
theoretically allow for near-infinite context with sub-linear memory growth.
Each of these pieces multiply a lot models performance. If we eventually
combine all those, we might be looking at a significant leap in local LLM
feasibility. Would this potentially let us fit 70B+ models into commodity
hardware with let's say 16GB VRAM, without losing much speed or
intelligence?
Quick AI-assisted calculations lead me to 2000x performance improvements.
Seems a bit mind-blowing. Am I oversimplifying that synergy? Would
llama.cpp be able to run all this? Has anyone else explored this
'Full-Stack Efficiency' horizon?
—
Reply to this email directly, view it on GitHub
<#20969?email_source=notifications&email_token=BZQNGKHSJFX3FF4D3NTJJTD4TTFAZA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNRUGA2TGNBQUZZGKYLTN5XKM3LBNZ2WC3FFMV3GK3TUVRTG633UMVZF6Y3MNFRWW#discussioncomment-16405340>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BZQNGKGGC3V5AFXBGTYH4DL4TTFAZAVCNFSM6AAAAACW6IA4XKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMNBQGUZTIMA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***
com>
|
Beta Was this translation helpful? Give feedback.
-
TurboQuant v1.3.0 — Bulletproof head_dim Detection + All Reported Issues FixedHi all, v1.3.0 addresses every issue reported in this thread. > Build Notice: Previous v1.3.0 binaries had incorrect CUDA architecture (SM 52 default) and missing CUDA runtime DLLs on Windows. All binaries have been pulled and are being rebuilt with SM 75/80/86/89/90/120 support. New binaries uploading soon. Apologies for the inconvenience. Release: https://github.com/AmesianX/TurboQuant/releases/tag/v1.3.0 Issues Resolved
Key Change: head_dim Detection Completely RedesignedThe old code only read New detection uses a 5-level priority cascade with cross-validation:
All signals are logged for diagnostics: Critical Discovery: n_embd/n_head is WRONG for many models
turbo4-K PPL explosion was not reproduced in our fork. Our turbo4-K gives PPL 6.73 (+7.5% vs F16 6.26) on Qwen3-30B-A3B — completely normal. After reviewing @TheTom's turbo4-resurrection paper, the PPL explosion in his fork was caused by 7 kernel-level bugs (SET_ROWS turbo3/turbo4 packing mismatch, missing QJL steps, etc.), not head_dim misdetection. Our implementation was built independently and does not share these bugs. Benchmark (v1.3.0, Qwen3-30B-A3B Q4_K_M, DGX Spark GB10)
@fritolays @TheTom @sztlink @modderBUG — would appreciate retesting with v1.3.0 on your hardware once builds are up. The standalone Windows build (
Release: https://github.com/AmesianX/TurboQuant/releases/tag/v1.3.0 |
Beta Was this translation helpful? Give feedback.
-
v1.3.0 Benchmark: Qwen3.5-27B Distilled Model (Claude 4.6 Opus Reasoning)@TheTom reported turbo4-K PPL 8.22 on "Qwen3.5-27B distill" — here is our retest with v1.3.0 (P1→P5 head_dim fix applied). Model: Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled Q4_K_M PPL + KV Memory + Speed
Key finding: tbqp3/tbq3 at 5.2x compression with only +1.1% PPL — essentially lossless at head_dim=256 with QJL. This confirms turbo4-K works correctly in our independent implementation. The PPL explosion @TheTom reported in his fork was traced to 7 kernel-level bugs (see his turbo4-resurrection paper), not head_dim misdetection. Our fork does not share these bugs. Scientist Name Transliteration Test ("Pauli Test")Korean has an official national standard (외래어 표기법) for transliterating foreign names — exactly ONE correct answer per name. This makes it an extremely sensitive test for attention precision. Config:
4/6 exact, 2/6 minor variant. The two Release: https://github.com/AmesianX/TurboQuant/releases/tag/v1.3.0 |
Beta Was this translation helpful? Give feedback.
-
Test on Ada Lovelace (sm_89) + Large MoE (122B) ResultsTested TurboQuant on hardware and model size not yet covered by the community it seems. Built from @TheTom's fork ( Model: Qwen3.5-122B-A10B Q5_K_S (Unsloth imatrix, 86.4GB, 3 split GGUFs) 1. KV Cache Size (1x82K context, 12 attn layers)
Only 12 of 48 layers use KV cache (the rest are recurrent), so absolute sizes are smaller than a pure transformer, but compression ratios hold per-layer. 2. Decode Performance (2x L40S, 82K ctx)
I used the exact same prompt for all tests. 3. Dual-Slot 2x82K — The UnlockOur production config (q8_0, 1x82K + mmproj) left only 186 MiB free the first GPU. turbo3 changes this:
TurboQuant rotation matrices initialized correctly at 128x128. Both slots verified with concurrent requests, no cross-contamination. 4. VRAM Summary
Notes
Happy to run additional tests if useful, jsut let me know, for example specific context lengths, turbo2 results or other stuff. |
Beta Was this translation helpful? Give feedback.
-
|
Tested turboquant fork from @Madreag at 1766c91 on two vGPUs from Tested the following combos: turbo4/turbo4 |
Beta Was this translation helpful? Give feedback.
-
TurboQuant v1.3.0 - Model Sweep@AmesianX here is the new set of tests, thanks again for the fast updates. Environment
Findingsv1.3.0 Head_dim DetectionThe new 5-level priority cascade (P1→P5) resolves all previous auto-detection failures. Phi-4 and DeepSeek, which failed in v1.2.0, now work correctly. GLM-4.7-Flash (head_dim=576) is correctly detected but unsupported, falling back to f16. Qwen3.5 Turbo V Quality IssueQwen3.5 models output only question marks when using turbo types on both K and V. Tested on 27B Q3_K_M and confirmed on 9B Q8_0 — issue is independent of weight quant level. Turbo K + q8_0 V works correctly. Other Qwen models (Qwen2.5-14B) are unaffected. Context vs Speed TradeoffAnalysed across models where TurboQuant provides meaningful context gains beyond q4_0 (DeepSeek-R1-14B, Gemma-3-27B, Magistral-Small-2509, Magistry-24B, Qwen3.5-27B, Llama3.2-24B-A3B): Best context: Best balance: Architecture Summary
Test ResultsClick to view, caution its extensive...
|
Beta Was this translation helpful? Give feedback.
-
|
@TheTom |
Beta Was this translation helpful? Give feedback.
-
|
Damn, getting the bug I was getting the other day that was patched upstream in [create_node] invalid tensor: null buffer (id=94197996336480) |
Beta Was this translation helpful? Give feedback.
-
|
Hi,
I have been running tests across different forks on RTX 4070ti with those parameters :
Qwen3.5-35B-A3B-UD-IQ2_XXS.gguf -ngl 99 -fa on --host 0.0.0.0 --port 8081 -np 1 -t 10 --top_p 0.95 --top_k 20, --min_p 0.0 --presence_penalty 0.0 --repeat_penalty 1.0 --temp .98 --jinja -ctk tbqp3 -ctv tbqp3 --ctx-size 130000 -fit off
My testing has been real agentic workload with long running sessions, results so far :
1. WORKS FINE | spiritbuun/llama-cpp-turboquant (https://github.com/spiritbuun )
2. WORKS FINE | Madrear/turbo3-cuda ( https://github.com/Madreag )
3. https://github.com/AmesianX/TurboQuant/ - gets “/////////” after some 10k-20k tokens, tried with smaller context same thing. I did build for the specific CUDA platform ( RTX 4070ti - CMAKE_CUDA_ARCHITECTURES=89-real)
BR
From: AmesianX ***@***.***
Sent: Thursday, April 2, 2026 10:43 AM
To: ggml-org/llama.cpp
Cc: plam40; Manual
Subject: Re: [ggml-org/llama.cpp] TurboQuant - Extreme KV Cache Quantization (Discussion #20969)
@fritolays <https://github.com/fritolays> Follow-up: we tested tbq3_0/tbq3_0 on Qwen3.5-27B-UD-Q4_K_XL on our local build (SM 121 native, DGX Spark GB10) and cannot reproduce the "???" issue. Output is normal:
Hello! I'm doing well, thanks for asking. How about you?
This strongly suggests the issue is build-related (SM 52 vs native). Could you confirm whether you used the release binary or built from source? That will tell us if this is a CUDA arch problem or a real bug we need to fix.
—
Reply to this email directly, view it on GitHub <#20969?email_source=notifications&email_token=BZD7QBWBYBOMLN5H3R5XDH34TYKZNA5CNFSNUABIM5UWIORPF5TWS5BNNB2WEL2ENFZWG5LTONUW63SDN5WW2ZLOOQXTCNRUGE4TINRVUZZGKYLTN5XKM3LBNZ2WC3FFMV3GK3TUVRTG633UMVZF6Y3MNFRWW#discussioncomment-16419465> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/BZD7QBUWVSLB6HDVYNEN45L4TYKZNAVCNFSM6AAAAACW6IA4XKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMNBRHE2DMNI> .
You are receiving this because you are subscribed to this thread. <https://github.com/notifications/beacon/BZD7QBXEOYJ7OPRGYHE3DRD4TYKZNBFCNFSM6AAAAACW6IA4XKWGG33NNVSW45C7OR4XAZNRIRUXGY3VONZWS33OINXW23LFNZ2KUY3PNVWWK3TUL5UWJTQA7KFITJTSMVQXG33OUZWWC3TVMFWA.gif> Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
|
@TheTom |
Beta Was this translation helpful? Give feedback.
-
|
Related but different direction: I posted a show-and-tell here on an adaptive TurboQuant + RotateKV K-cache path for Qwen2.5: Calibration-driven per-layer / per-channel K compression, with asymmetric K/V support. A few numbers:
Still a proof of concept and slower on decode, but maybe relevant to people following this thread. |
Beta Was this translation helpful? Give feedback.
-
Ampere (RTX 3080 Ti) Benchmark Data — Mixed KV Parity, PR #36 Validation, PR #43 turbo4 Fix@TheTom — you mentioned wanting CUDA mixed Model: Qwen3.5 9B Q4_K_M (bartowski) Mixed q8_0 × turbo — Speed (t/s)
Mixed q8_0 × turbo — PPL (wikitext-2, 16K context)
Takeaway: On Ampere, mixed PR #36 (signalnine FA optimizations) — Ampere A/BTested
PPL: symmetric turbo3 shifted from 6.6245 to 6.6478 — consistent with auto-asymmetric overriding to q8_0 K (matches the q8_0/turbo3 baseline exactly). Auto-asymmetric is working correctly. Takeaway: PR #36 gives +35% decode on RTX 5090 but only +1% on Ampere. No regression on any config. The shmem LUT and occupancy tuning are Blackwell-optimized — Ampere's smaller shared memory doesn't benefit as much. Safe to merge from an Ampere perspective. PR #43 (Dubascudes turbo4 WHT dequant fix) — ValidatedWe previously reported turbo4 K+V producing PPL 6952 on Llama arch (NeuralDaredevil 8B). Tested PR #43 @
turbo4/turbo4 speed: pp512=3970, pp16384=3707, tg128=101.8 The WHT/dense rotation mismatch in the C reference was indeed the root cause. turbo4 K+V is now functional on CUDA. Hardware NoteRTX 3080 Ti (12GB, SM 86, Ampere, PCIe Gen 3 x16). This is representative of the 30-series/40-series consumer cards that most people are actually running. Happy to run additional configs if useful. |
Beta Was this translation helpful? Give feedback.
-
|
Hi, sharing task-accuracy benchmarks as a complement to PPL — since as @TheTom noted, PPL and generation quality can diverge. We ran a KV cache recall benchmark on Qwen3-14b (no-think mode) using the spiritbuun CUDA fork, testing whether quantization corrupts recall of information stored earlier in context. The method: place a math problem at position 0, fill with N tokens of unrelated text, ask the model to recall and solve. 52 tests: 2×2 matrix mult, 3×3 matrix mult, scalar arithmetic — at 0/200/500/1000 token filler distances.
Key findings:
The benchmark script is open source: https://github.com/eullm/eullm/blob/main/bench/turboquant_math_accuracy.py Tested on EULLM Engine (Ollama-compatible Rust inference server) wrapping the spiritbuun CUDA fork. ctx=16384, temperature=0, num_predict=2048. |
Beta Was this translation helpful? Give feedback.



Uh oh!
There was an error while loading. Please reload this page.
-
Google Research just posted a blog and paper about a new algorithm that allows quantizing the KV cache down to under 3 bits with close to 0 accuracy loss.
Blog: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
Paper: https://arxiv.org/pdf/2504.19874
This could be huge if their claims are true and MLX developers are already jumping on this
https://x.com/Prince_Canuma/status/2036611007523512397
Thought I'd share the news here to see if llama.cpp developers would be interested in adding this feature.
Beta Was this translation helpful? Give feedback.
All reactions