feat: Vulkan compute shader support for turbo3 (experimental) by apollosenvy · Pull Request #33 · TheTom/llama-cpp-turboquant

apollosenvy · 2026-03-31T02:02:22Z

Summary

Vulkan backend support for turbo3 KV cache quantization. Experimental -- works on AMD 7900 XTX via RADV.

New files:

dequant_turbo3_0.comp -- standalone dequant shader (3-bit index from 2-bit qs + 1-bit signs)
dequant_funcs.glsl -- inline dequant/dequant4/get_dm for get_rows/mul_mat
dequant_funcs_cm2.glsl -- cooperative matrix 2 FA path
copy_to_quant.comp -- quantize with norm correction
types.glsl -- block_turbo3_0 struct
vulkan-shaders-gen.cpp -- turbo3_0 type registration
ggml-vulkan.cpp -- pipeline creation + supports_op

Benchmark (AMD 7900 XTX, RADV, Vulkan)

Test	F16 KV	turbo3 KV	Ratio
pp128	748	264	35%
tg32	36.0	27.4	76%
tg128	33.3	26.4	79%

The tg numbers are already faster than ROCm HIP (25.2 t/s). The pp gap is from the standalone dequant path -- inline FA dequant would close it.

Status

Quantize/dequant: working
get_rows: working
set_rows: working
Flash attention: works via dequant-to-F16 path (not inline turbo3 FA)
coopmat2 FA: shader compiles, untested on hardware

Marked experimental. Tested on RADV only.

🤖 Generated with Claude Code

Full turbo3 quantize/dequant pipeline for Vulkan backend: - types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4]) - dequant_turbo3_0.comp: standalone dequant shader (3-bit index reconstruction from 2-bit qs + 1-bit signs, centroid lookup) - dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths - dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support - copy_to_quant.comp: quantize function with norm correction - vulkan-shaders-gen.cpp: turbo3_0 type registration - ggml-vulkan.cpp: pipeline creation and supports_op dispatch Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes TheTom#33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

Half-precision centroid table in vec flash attention dequant. Reduces constant cache pressure at high access volumes. Decode improvements: Short: 75.3 → 77.2 (+2.5%) 8K: 59.2 → 67.3 (+13.7%) 48K (Mario PDF): 36.7 → 39.0 (+6.3%) PPL: unchanged (6.211) Prefill: no regression Fixes #33 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: tturney@psyguard.ai

github-actions bot added ggml Vulkan labels Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Vulkan compute shader support for turbo3 (experimental)#33

feat: Vulkan compute shader support for turbo3 (experimental)#33
apollosenvy wants to merge 1 commit intoTheTom:feature/turboquant-kv-cachefrom
apollosenvy:pr/vulkan-turbo3

apollosenvy commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

apollosenvy commented Mar 31, 2026

Summary

Benchmark (AMD 7900 XTX, RADV, Vulkan)

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants