feat: Vulkan compute shader support for turbo3 (experimental)#33
Open
apollosenvy wants to merge 1 commit intoTheTom:feature/turboquant-kv-cachefrom
Open
feat: Vulkan compute shader support for turbo3 (experimental)#33apollosenvy wants to merge 1 commit intoTheTom:feature/turboquant-kv-cachefrom
apollosenvy wants to merge 1 commit intoTheTom:feature/turboquant-kv-cachefrom
Conversation
Full turbo3 quantize/dequant pipeline for Vulkan backend: - types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4]) - dequant_turbo3_0.comp: standalone dequant shader (3-bit index reconstruction from 2-bit qs + 1-bit signs, centroid lookup) - dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths - dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support - copy_to_quant.comp: quantize function with norm correction - vulkan-shaders-gen.cpp: turbo3_0 type registration - ggml-vulkan.cpp: pipeline creation and supports_op dispatch Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
terrysimons
pushed a commit
to terrysimons/llama-cpp-turboquant
that referenced
this pull request
Mar 31, 2026
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
PGCRT
pushed a commit
to PGCRT/llama-cpp-turboquant-cuda
that referenced
this pull request
Apr 1, 2026
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
TheTom
added a commit
that referenced
this pull request
Apr 2, 2026
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes #33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
TheTom
added a commit
that referenced
this pull request
Apr 2, 2026
Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes #33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Vulkan backend support for turbo3 KV cache quantization. Experimental -- works on AMD 7900 XTX via RADV.
New files:
dequant_turbo3_0.comp-- standalone dequant shader (3-bit index from 2-bit qs + 1-bit signs)dequant_funcs.glsl-- inline dequant/dequant4/get_dm for get_rows/mul_matdequant_funcs_cm2.glsl-- cooperative matrix 2 FA pathcopy_to_quant.comp-- quantize with norm correctiontypes.glsl-- block_turbo3_0 structvulkan-shaders-gen.cpp-- turbo3_0 type registrationggml-vulkan.cpp-- pipeline creation + supports_opBenchmark (AMD 7900 XTX, RADV, Vulkan)
The tg numbers are already faster than ROCm HIP (25.2 t/s). The pp gap is from the standalone dequant path -- inline FA dequant would close it.
Status
Marked experimental. Tested on RADV only.
🤖 Generated with Claude Code