feat: turbo3/turbo2 mixed KV cache type support (CUDA)#29
Closed
seanrasch wants to merge 1 commit intoTheTom:feature/turboquant-kv-cachefrom
Closed
feat: turbo3/turbo2 mixed KV cache type support (CUDA)#29seanrasch wants to merge 1 commit intoTheTom:feature/turboquant-kv-cachefrom
seanrasch wants to merge 1 commit intoTheTom:feature/turboquant-kv-cachefrom
Conversation
Enable asymmetric KV cache precision with --cache-type-k turbo3
--cache-type-v turbo2. K cache uses 3-bit turbo3 (more sensitive —
affects shared softmax denominator), V cache uses 2-bit turbo2 (less
sensitive — contributes linearly, weighted by attention).
Compression: 5.33x vs f16 (up from 4.57x with turbo3/turbo3).
Benchmarked on Qwen3 8B Q4_K_M, RTX 3080 Ti (SM 86):
Quality (wikitext-2 PPL):
f16/f16: 10.44
turbo3/turbo3: 11.57 (+1.13)
turbo3/turbo2: 11.88 (+1.44, only +0.31 over turbo3/turbo3)
Throughput:
turbo3/turbo3 pp32768: 2569 t/s, tg128: 113.9 t/s
turbo3/turbo2 pp32768: 2583 t/s, tg128: 112.8 t/s
K/V asymmetric precision is well-established in the literature (KIVI,
QAQ, KVTuner) — V is structurally less sensitive because each V vector
is consumed linearly in proportion to its attention weight, while K
vectors affect the global softmax distribution.
Changes:
- Add fattn-vec template instances for turbo3_0/turbo2_0 combinations
- Add extern declarations in fattn-vec.cuh
- Add dispatch cases in fattn.cu (vec kernel + mixed-type validation)
- Add source files to CMakeLists.txt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vonempalmeolmos
pushed a commit
to vonempalmeolmos/llama-cpp-turboquant
that referenced
this pull request
Mar 29, 2026
Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed TheTom#3 (TURBO_D). TheTom#1 and TheTom#2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vonempalmeolmos
pushed a commit
to vonempalmeolmos/llama-cpp-turboquant
that referenced
this pull request
Mar 29, 2026
…ling (Issue TheTom#29) Three bugs from the block-size-32 refactor: 1. kernel_set_rows_turbo hardcoded turbo3 packing for turbo4 — split into separate kernel_set_rows_turbo3 and kernel_set_rows_turbo4 kernels. turbo4 now correctly does 3-bit PolarQuant + QJL residual correction. 2. Integer division in n_groups = nk0 / blocks_per_group silently dropped tail blocks for non-128-aligned head dims (e.g. dk=192). Added ceiling division with tail-group bounds checking in turbo3, and GGML_ASSERT in WHT dispatch to catch non-128-aligned tensors. 3. TURBO_D constant was semantically coupled to QK_TURBO4 — replaced with TURBO_ROT_DIM (= QK_TURBO3_GROUP) and added static_assert that QK_TURBO4 == QK_TURBO3_GROUP to guard against future drift. Closes TheTom#29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
vonempalmeolmos
pushed a commit
to vonempalmeolmos/llama-cpp-turboquant
that referenced
this pull request
Mar 29, 2026
fix: turbo4 SET_ROWS, tail-block truncation, constant coupling, stack overflow (Issue TheTom#29)
seanrasch
added a commit
to seanrasch/llama-cpp-turboquant
that referenced
this pull request
Mar 31, 2026
1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack then memcpy'd into the static turbo_rotation array. This segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate the Gaussian matrix directly into turbo_rotation, eliminating both the stack allocation and the memcpy. 2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always match (both represent the rotation group size). Add static_assert to catch silent divergence between CPU reference and GPU kernels. Fixes: TheTom#29 (remaining items from PR TheTom#18 review) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Author
|
Closing — this is now fully covered by the merged work on
Thanks for merging these upstream — no action needed here. |
mihai-chiorean
pushed a commit
to mihai-chiorean/turbo3-cuda
that referenced
this pull request
Mar 31, 2026
Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed TheTom#3 (TURBO_D). #1 and TheTom#2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
seanrasch
added a commit
to seanrasch/llama-cpp-turboquant
that referenced
this pull request
Mar 31, 2026
1. turbo_init_rotation() allocated float G[128*128] (64KB) on the stack then memcpy'd into the static turbo_rotation array. This segfaults on llama.cpp worker threads with reduced stack sizes (512KB macOS, 64KB some Linux). Fix: generate the Gaussian matrix directly into turbo_rotation, eliminating both the stack allocation and the memcpy. 2. TURBO_D and QK_TURBO3_GROUP are defined separately but must always match (both represent the rotation group size). Add static_assert to catch silent divergence between CPU reference and GPU kernels. Fixes: TheTom#29 (remaining items from PR TheTom#18 review) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom
added a commit
that referenced
this pull request
Apr 2, 2026
Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed #3 (TURBO_D). #1 and #2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom
pushed a commit
that referenced
this pull request
Apr 2, 2026
…ling (Issue #29) Three bugs from the block-size-32 refactor: 1. kernel_set_rows_turbo hardcoded turbo3 packing for turbo4 — split into separate kernel_set_rows_turbo3 and kernel_set_rows_turbo4 kernels. turbo4 now correctly does 3-bit PolarQuant + QJL residual correction. 2. Integer division in n_groups = nk0 / blocks_per_group silently dropped tail blocks for non-128-aligned head dims (e.g. dk=192). Added ceiling division with tail-group bounds checking in turbo3, and GGML_ASSERT in WHT dispatch to catch non-128-aligned tensors. 3. TURBO_D constant was semantically coupled to QK_TURBO4 — replaced with TURBO_ROT_DIM (= QK_TURBO3_GROUP) and added static_assert that QK_TURBO4 == QK_TURBO3_GROUP to guard against future drift. Closes #29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom
added a commit
that referenced
this pull request
Apr 2, 2026
Codex post-commit review found: 1. TURBO_D was QK_TURBO3 (now 32) — broke turbo4 C array sizes 2. SET_ROWS kernel turbo3-specific but instantiated for turbo4 3. Tail block drop for non-128 head dims Fixed #3 (TURBO_D). #1 and #2 don't affect turbo3+dk128 path. Co-Authored-By: tturney@psyguard.ai Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
TheTom
pushed a commit
that referenced
this pull request
Apr 2, 2026
…ling (Issue #29) Three bugs from the block-size-32 refactor: 1. kernel_set_rows_turbo hardcoded turbo3 packing for turbo4 — split into separate kernel_set_rows_turbo3 and kernel_set_rows_turbo4 kernels. turbo4 now correctly does 3-bit PolarQuant + QJL residual correction. 2. Integer division in n_groups = nk0 / blocks_per_group silently dropped tail blocks for non-128-aligned head dims (e.g. dk=192). Added ceiling division with tail-group bounds checking in turbo3, and GGML_ASSERT in WHT dispatch to catch non-128-aligned tensors. 3. TURBO_D constant was semantically coupled to QK_TURBO4 — replaced with TURBO_ROT_DIM (= QK_TURBO3_GROUP) and added static_assert that QK_TURBO4 == QK_TURBO3_GROUP to guard against future drift. Closes #29 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enable asymmetric KV cache precision: turbo3 K + turbo2 V.
--cache-type-k turbo3 --cache-type-v turbo2K/V asymmetric precision is well-established — KIVI, QAQ, and KVTuner all demonstrate
that V tolerates lower precision than K. Each V vector is consumed in proportion to
its attention weight; at long context most V positions receive near-zero attention
(this is why sparse V dequant works). Compressing them further costs almost nothing.
Benchmarks (RTX 3080 Ti, SM 86)
Quality (wikitext-2 PPL)
Validated across two architectures (Qwen, Llama) and two weight quants (Q4_K_M, Q8_0):
turbo3/turbo2 adds +0.26 to +0.31 PPL over turbo3/turbo3 — consistent across
models and architectures. This is 22-27% of the cost already accepted for turbo3
compression.
Throughput (3 runs each)
Qwen3 8B (Q4_K_M):
Qwen3.5 9B (Q4_K_M):
NeuralDaredevil 8B (Q8_0):
Throughput impact ranges from -2.6% to +0.5% on prefill, -1% to +0.3% on decode.
At 32K context (where the memory savings matter most) all three models show neutral
or positive throughput.
Memory
14.3% less KV memory than turbo3/turbo3.
Changes
fattn-vec-instance-turbo3_0-turbo2_0.cuandturbo2_0-turbo3_0.cutemplate instancesfattn-vec.cuhfattn.cuCMakeLists.txt5 files changed, +38 lines. No algorithmic changes. All existing code paths unchanged.
Test plan
--cache-type-k turbo3 --cache-type-v turbo2runs without errors🤖 Generated with Claude Code