Skip to content

feat: Vulkan compute shader support for turbo3 (experimental)#33

Open
apollosenvy wants to merge 1 commit intoTheTom:feature/turboquant-kv-cachefrom
apollosenvy:pr/vulkan-turbo3
Open

feat: Vulkan compute shader support for turbo3 (experimental)#33
apollosenvy wants to merge 1 commit intoTheTom:feature/turboquant-kv-cachefrom
apollosenvy:pr/vulkan-turbo3

Conversation

@apollosenvy
Copy link
Copy Markdown

Summary

Vulkan backend support for turbo3 KV cache quantization. Experimental -- works on AMD 7900 XTX via RADV.

New files:

  • dequant_turbo3_0.comp -- standalone dequant shader (3-bit index from 2-bit qs + 1-bit signs)
  • dequant_funcs.glsl -- inline dequant/dequant4/get_dm for get_rows/mul_mat
  • dequant_funcs_cm2.glsl -- cooperative matrix 2 FA path
  • copy_to_quant.comp -- quantize with norm correction
  • types.glsl -- block_turbo3_0 struct
  • vulkan-shaders-gen.cpp -- turbo3_0 type registration
  • ggml-vulkan.cpp -- pipeline creation + supports_op

Benchmark (AMD 7900 XTX, RADV, Vulkan)

Test F16 KV turbo3 KV Ratio
pp128 748 264 35%
tg32 36.0 27.4 76%
tg128 33.3 26.4 79%

The tg numbers are already faster than ROCm HIP (25.2 t/s). The pp gap is from the standalone dequant path -- inline FA dequant would close it.

Status

  • Quantize/dequant: working
  • get_rows: working
  • set_rows: working
  • Flash attention: works via dequant-to-F16 path (not inline turbo3 FA)
  • coopmat2 FA: shader compiles, untested on hardware

Marked experimental. Tested on RADV only.

🤖 Generated with Claude Code

Full turbo3 quantize/dequant pipeline for Vulkan backend:

- types.glsl: block_turbo3_0 struct (norm + qs[8] + signs[4])
- dequant_turbo3_0.comp: standalone dequant shader (3-bit index
  reconstruction from 2-bit qs + 1-bit signs, centroid lookup)
- dequant_funcs.glsl: inline dequant for get_rows/mul_mat paths
- dequant_funcs_cm2.glsl: cooperative matrix 2 FA path support
- copy_to_quant.comp: quantize function with norm correction
- vulkan-shaders-gen.cpp: turbo3_0 type registration
- ggml-vulkan.cpp: pipeline creation and supports_op dispatch

Tested on AMD 7900 XTX (RADV): 243 pp / 25.8 tg t/s with turbo3 KV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
terrysimons pushed a commit to terrysimons/llama-cpp-turboquant that referenced this pull request Mar 31, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes TheTom#33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
PGCRT pushed a commit to PGCRT/llama-cpp-turboquant-cuda that referenced this pull request Apr 1, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes TheTom#33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
TheTom added a commit that referenced this pull request Apr 2, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes #33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
TheTom added a commit that referenced this pull request Apr 2, 2026
Half-precision centroid table in vec flash attention dequant.
Reduces constant cache pressure at high access volumes.

Decode improvements:
  Short: 75.3 → 77.2 (+2.5%)
  8K: 59.2 → 67.3 (+13.7%)
  48K (Mario PDF): 36.7 → 39.0 (+6.3%)

PPL: unchanged (6.211)
Prefill: no regression

Fixes #33

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: tturney@psyguard.ai
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants